Bug 99502 - LTC3549 - ps wchan broken
LTC3549 - ps wchan broken
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Howells
Brian Brock
:
Depends On:
Blocks: 156320
  Show dependency treegraph
 
Reported: 2003-07-21 09:01 EDT by Kaena Freitas
Modified: 2007-11-30 17:06 EST (History)
4 users (show)

See Also:
Fixed In Version: RHSA-2005-663
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-09-28 10:18:37 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed patch for wchan issue in ps listing (2.05 KB, patch)
2003-08-06 16:45 EDT, Julie DeWandel
no flags Details | Diff
Simpler patch. (1.84 KB, patch)
2005-04-17 22:31 EDT, David Woodhouse
no flags Details | Diff
even simpler patch (563 bytes, patch)
2005-04-20 02:43 EDT, David Woodhouse
no flags Details | Diff
even simpler patch with correct use of inline (570 bytes, patch)
2005-04-20 09:33 EDT, David Howells
no flags Details | Diff

  None (edit)
Description Kaena Freitas 2003-07-21 09:01:20 EDT
Summary: [specweb99] ps wchan broken
            Vendor: Red Hat Linux for pSeries
           Version: RHEL 3.0
          Platform: pSeries
      Architecture: PPC-64
Submitting Project: pSeries Performance
 Customer Priority: P2
       Owning Team: pSeries
    OSC Acceptance: N/S
   Customer Status: N/S
     Required Date: 0000-00-00 00:00:00
       Target Date: 2000-00-00 00:00:00
     Make External: NO
            Status: OPEN
     Test Activity: Performance
    Reported Phase: Development
Technical Severity: normal
 Engineer Priority: P2
         Component: Base System
             Owner: kaena@us.ibm.com
       SubmittedBy: milliner@us.ibm.com
         QAContact: olof@us.ibm.com


Hardware Environment: p630 (LER-GQ) with 6 goliad adapters

Software Environment: RHEL4.0

Steps to Reproduce:
1. ps -axln

Actual Results: ps -axln
F   UID   PID  PPID PRI  NI   VSZ  RSS  WCHAN STAT TTY        TIME COMMAND
4     0     1     0  15   0  1612  584  58308 S    ?          0:07 init [3]
5     0     2     0 -100  0     0    0  58308 SW   ?          0:00 [swapper]
5     0     3     0 -100  0     0    0  58308 SW   ?          0:00 [swapper]
5     0     4     0 -100  0     0    0  58308 SW   ?          0:00 [swapper]
5     0     5     0 -100  0     0    0  58308 SW   ?          0:00 [swapper]
1     0     6     1  15   0     0    0  58308 SW   ?          0:00 [keventd]

Expected Results: example
F   UID   PID  PPID PRI  NI   VSZ  RSS  WCHAN STAT TTY        TIME COMMAND
4     0     1     0  15   0  1472  564  60608 S    ?          3:26 init
5     0     2     1 -100  0     0    0  4ff58 SW   ?          0:00 [migration/0]
1     0     3     1  34  19     0    0  5a140 SWN  ?          0:00 [ksoftirqd/0]
5     0     4     1 -100  0     0    0  4ff58 SW   ?          0:00 [migration/1]
1     0     5     1  34  19     0    0  5a140 SWN  ?          0:00 [ksoftirqd/1]
5     0     6     1 -100  0     0    0  4ff58 SW   ?          0:00 [migration/2]
1     0     7     1  34  19     0    0  5a140 SWN  ?          0:00 [ksoftirqd/2]


Additional Information: Probelm seems to exist on SuSE SLES8 as well.
Comment 1 Kaena Freitas 2003-07-21 10:09:28 EDT
------- Additional Comments From olof@us.ibm.com(prefers email via 
olof@austin.ibm.com)  2003-21-07 10:08 -------
Nancy, what did you use to produce the correct output in the example above?
Other options? Other kernel?
Comment 2 Julie DeWandel 2003-07-23 10:06:40 EDT
Mark (mdewand) has discovered that the get_wchan kernel code is definitely
broken for ppc64. He's working on preparing a fix for this problem.
Comment 3 Olof Johansson 2003-08-05 19:18:44 EDT
------- Additional Comments From mahuja@us.ibm.com(prefers email via ahuja@austin.ibm.com)  2003-05-08 19:10 -------
Nothing wrong with the ps code..
All procs in /proc/pid/stat show the same values..
Still investigating..
Comment 4 Julie DeWandel 2003-08-06 07:25:57 EDT
This problem was investigated and an issue found with the value ppc64 was
reporting in both the 'ps' listing and in /proc/<pid>/stat. In all cases it was
reporting the address of 'context_switch' which does not at all correspond to
what the man page describes the wchan value to be: namely the address of a
system call. The problem has been corrected and the change is currently under
review by the kernel team.
Comment 5 Olof Johansson 2003-08-06 16:26:19 EDT
------- Additional Comments From olof@us.ibm.com(prefers email via olof@austin.ibm.com)  2003-06-08 16:20 -------
Julie, can we have a look at the patch as well? Or post it to linuxppc64-dev.
Thanks.

Comment 6 Julie DeWandel 2003-08-06 16:45:22 EDT
Created attachment 93459 [details]
Proposed patch for wchan issue in ps listing

Currently this patch is under review and may not be the final form.
Comment 7 Julie DeWandel 2003-08-11 15:09:58 EDT
A fix for this problem (similar to the attached patch) was submitted on
2003-08-09 and will be in the beta2 release.
Comment 8 Olof Johansson 2003-08-12 15:00:49 EDT
------- Additional Comments From milliner@us.ibm.com  2003-12-08 14:51 -------
I need a kernel with the wchan fix in it as soon as possible.  Can someone get 
me one?

Comment 9 Rik van Riel 2003-08-12 15:09:30 EDT
a kernel with Mark's patch in it should be available in sushi already
Comment 10 Olof Johansson 2003-08-12 17:57:54 EDT
------- Additional Comments From milliner@us.ibm.com  2003-12-08 17:51 -------
Olof built me a kernel with the patch.  It is working now.  But we can't close 
out the bug until we recieve Beta 2.
Comment 11 Rik van Riel 2003-08-12 18:02:04 EDT
You do not need to wait for B2.  Sushi will have a fixed kernel soon.
Comment 12 Tim Powers 2003-08-12 19:19:35 EDT
The fixed kernel should be available now.
Comment 13 David Woodhouse 2005-04-17 22:20:34 EDT
This fix is somewhat OTT and causes other problems (see bug #142604; LTC12957). 

The context_switch() function is supposed to be inlined. If it _isn't_ being
inlined, we should work out why -- it's only ever invoked once.

We should also place its definition inside the
'scheduling_functions_{start,end}_here' markers to prevent it from being
reported by get_wchan().  
Comment 14 David Woodhouse 2005-04-17 22:31:16 EDT
Created attachment 113305 [details]
Simpler patch.

Patch to simply move context_switch() inside the markers to make sure
get_wchan() will step over it even if the compiler decides not to inline it
today.
Comment 15 David Woodhouse 2005-04-20 02:43:58 EDT
Created attachment 113388 [details]
even simpler patch

Even with the previous patch, the compiler emits the context_switch() function
in the wrong place. This version attempts to _force_ the compiler to inline it,
by using __attribute__((always_inline)).
Comment 16 David Howells 2005-04-20 09:23:32 EDT
Patch incorporated into patch test kernels available on: 
 
http://people.redhat.com/~dhowells/.pickup/ibm/squadrons/rhel3.shtml 
Comment 17 David Howells 2005-04-20 09:33:59 EDT
Created attachment 113408 [details]
even simpler patch with correct use of inline

This patch adds a missing inline directive to the previous patch.
Comment 18 David Woodhouse 2005-05-06 11:40:07 EDT
The version of the patch posted by dhowells on 2005-04-20 appears to work
correctly for us. Still awaiting confirmation from IBM.
Comment 19 IBM Bug Proxy 2005-05-06 12:02:52 EDT
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |olof@us.ibm.com




------- Additional Comments From olof@us.ibm.com(prefers email via olof@austin.ibm.com)  2005-05-06 11:57 EDT -------
You just asked people to retest a fix for an 18 month old bug that's already
been fixed once. Test setups that discovered this have since long been taken
down and reassigned to other work.

Still, given the simple "expected / actual" results, I'm sure we can trust
RedHat's testing to have been sufficient. :-) 
Comment 20 David Woodhouse 2005-05-06 12:15:10 EDT
I'm happy enough that we're giving correct wchan results -- I'm more interested
in the question of whether the saner, simpler fix has eliminated the problems
reported in bug #142604 (LTC12957).
Comment 21 IBM Bug Proxy 2005-05-06 14:08:04 EDT
---- Additional Comments From olof@us.ibm.com(prefers email via olof@austin.ibm.com)  2005-05-06 14:05 EDT -------
Good point. I'll try to find a machine to test on using the testcase I described
in that bug. Did you try that yourself? It should be pretty easy to reproduce.

I'll try to look at it today. 
Comment 22 David Woodhouse 2005-05-06 14:16:36 EDT
David Howells was looking at that bug -- I had the impression that he had not
been able to reproduce it in-house. I'll double-check with him but he's already
gone for the weekend. 
Comment 23 David Woodhouse 2005-05-10 17:22:56 EDT
David tells me that we were in fact able to reproduce bug #142604 (LTC12957)
in-house, but with the new version of the patch we are no longer able to do so.
Please confirm.
Comment 24 Ernie Petrides 2005-05-14 01:11:02 EDT
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.4.EL).
Comment 32 Red Hat Bugzilla 2005-09-28 10:18:37 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html

Note You need to log in before you can comment on or make changes to this bug.