Summary: [specweb99] ps wchan broken Vendor: Red Hat Linux for pSeries Version: RHEL 3.0 Platform: pSeries Architecture: PPC-64 Submitting Project: pSeries Performance Customer Priority: P2 Owning Team: pSeries OSC Acceptance: N/S Customer Status: N/S Required Date: 0000-00-00 00:00:00 Target Date: 2000-00-00 00:00:00 Make External: NO Status: OPEN Test Activity: Performance Reported Phase: Development Technical Severity: normal Engineer Priority: P2 Component: Base System Owner: kaena.com SubmittedBy: milliner.com QAContact: olof.com Hardware Environment: p630 (LER-GQ) with 6 goliad adapters Software Environment: RHEL4.0 Steps to Reproduce: 1. ps -axln Actual Results: ps -axln F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 1 0 15 0 1612 584 58308 S ? 0:07 init [3] 5 0 2 0 -100 0 0 0 58308 SW ? 0:00 [swapper] 5 0 3 0 -100 0 0 0 58308 SW ? 0:00 [swapper] 5 0 4 0 -100 0 0 0 58308 SW ? 0:00 [swapper] 5 0 5 0 -100 0 0 0 58308 SW ? 0:00 [swapper] 1 0 6 1 15 0 0 0 58308 SW ? 0:00 [keventd] Expected Results: example F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 1 0 15 0 1472 564 60608 S ? 3:26 init 5 0 2 1 -100 0 0 0 4ff58 SW ? 0:00 [migration/0] 1 0 3 1 34 19 0 0 5a140 SWN ? 0:00 [ksoftirqd/0] 5 0 4 1 -100 0 0 0 4ff58 SW ? 0:00 [migration/1] 1 0 5 1 34 19 0 0 5a140 SWN ? 0:00 [ksoftirqd/1] 5 0 6 1 -100 0 0 0 4ff58 SW ? 0:00 [migration/2] 1 0 7 1 34 19 0 0 5a140 SWN ? 0:00 [ksoftirqd/2] Additional Information: Probelm seems to exist on SuSE SLES8 as well.
------- Additional Comments From olof.com(prefers email via olof.com) 2003-21-07 10:08 ------- Nancy, what did you use to produce the correct output in the example above? Other options? Other kernel?
Mark (mdewand) has discovered that the get_wchan kernel code is definitely broken for ppc64. He's working on preparing a fix for this problem.
------- Additional Comments From mahuja.com(prefers email via ahuja.com) 2003-05-08 19:10 ------- Nothing wrong with the ps code.. All procs in /proc/pid/stat show the same values.. Still investigating..
This problem was investigated and an issue found with the value ppc64 was reporting in both the 'ps' listing and in /proc/<pid>/stat. In all cases it was reporting the address of 'context_switch' which does not at all correspond to what the man page describes the wchan value to be: namely the address of a system call. The problem has been corrected and the change is currently under review by the kernel team.
------- Additional Comments From olof.com(prefers email via olof.com) 2003-06-08 16:20 ------- Julie, can we have a look at the patch as well? Or post it to linuxppc64-dev. Thanks.
Created attachment 93459 [details] Proposed patch for wchan issue in ps listing Currently this patch is under review and may not be the final form.
A fix for this problem (similar to the attached patch) was submitted on 2003-08-09 and will be in the beta2 release.
------- Additional Comments From milliner.com 2003-12-08 14:51 ------- I need a kernel with the wchan fix in it as soon as possible. Can someone get me one?
a kernel with Mark's patch in it should be available in sushi already
------- Additional Comments From milliner.com 2003-12-08 17:51 ------- Olof built me a kernel with the patch. It is working now. But we can't close out the bug until we recieve Beta 2.
You do not need to wait for B2. Sushi will have a fixed kernel soon.
The fixed kernel should be available now.
This fix is somewhat OTT and causes other problems (see bug #142604; LTC12957). The context_switch() function is supposed to be inlined. If it _isn't_ being inlined, we should work out why -- it's only ever invoked once. We should also place its definition inside the 'scheduling_functions_{start,end}_here' markers to prevent it from being reported by get_wchan().
Created attachment 113305 [details] Simpler patch. Patch to simply move context_switch() inside the markers to make sure get_wchan() will step over it even if the compiler decides not to inline it today.
Created attachment 113388 [details] even simpler patch Even with the previous patch, the compiler emits the context_switch() function in the wrong place. This version attempts to _force_ the compiler to inline it, by using __attribute__((always_inline)).
Patch incorporated into patch test kernels available on: http://people.redhat.com/~dhowells/.pickup/ibm/squadrons/rhel3.shtml
Created attachment 113408 [details] even simpler patch with correct use of inline This patch adds a missing inline directive to the previous patch.
The version of the patch posted by dhowells on 2005-04-20 appears to work correctly for us. Still awaiting confirmation from IBM.
changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |olof.com ------- Additional Comments From olof.com(prefers email via olof.com) 2005-05-06 11:57 EDT ------- You just asked people to retest a fix for an 18 month old bug that's already been fixed once. Test setups that discovered this have since long been taken down and reassigned to other work. Still, given the simple "expected / actual" results, I'm sure we can trust RedHat's testing to have been sufficient. :-)
I'm happy enough that we're giving correct wchan results -- I'm more interested in the question of whether the saner, simpler fix has eliminated the problems reported in bug #142604 (LTC12957).
---- Additional Comments From olof.com(prefers email via olof.com) 2005-05-06 14:05 EDT ------- Good point. I'll try to find a machine to test on using the testcase I described in that bug. Did you try that yourself? It should be pretty easy to reproduce. I'll try to look at it today.
David Howells was looking at that bug -- I had the impression that he had not been able to reproduce it in-house. I'll double-check with him but he's already gone for the weekend.
David tells me that we were in fact able to reproduce bug #142604 (LTC12957) in-house, but with the new version of the patch we are no longer able to do so. Please confirm.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.4.EL).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html