Bug 426198

Summary:	LTE: Drop into xmon on uli03 during stress test run
Product:	Red Hat Enterprise Linux 4	Reporter:	IBM Bug Proxy <bugproxy>
Component:	kernel	Assignee:	Jeff Layton <jlayton>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Martin Jenner <mjenner>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.6	CC:	staubach, steved, vgoyal
Target Milestone:	rc
Target Release:	---
Hardware:	ppc64
OS:	All
URL:	ARRAY(0x8bcb30)
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-08-01 16:49:16 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description IBM Bug Proxy 2007-12-19 09:20:20 UTC

NOTE: Please see https://bugzilla.redhat.com/show_bug.cgi?id=402581#c9  by Jeff.
      This should be assigned to him (jlayton). Thanks!

=Comment: #0=================================================
Miao Tao Feng <fengmt.com> - 2007-12-18 22:10 EDT
---Problem Description---
  After the stress testing(BASE, IO, TCP, NFS focus areas) ran for about 30 
hours on uli03, the system droped into xmon and dmesg has a lots of error 
messages "nfs4_reclaim_open_state: unhandled error -116. Zeroing state".

  The kernel on uli03 is 2.6.9-68.1.EL.jtltest.25 and it tried to fix bug 
37993. We got it from http://people.redhat.com/jlayton/ .

XMON INFO:
3:mon> e
cpu 0x3: Vector: 300 (Data Access) at [c0000000d4e6fbd0]
    pc: d0000000004f7e54: .nfs4_reclaim_open_state+0x144/0x184 [nfs]
    lr: d0000000004f7e3c: .nfs4_reclaim_open_state+0x12c/0x184 [nfs]
    sp: c0000000d4e6fe50
   msr: 8000000000009032
   dar: 100100
 dsisr: 40000000
  current = 0xc0000001d33c35a0
  paca    = 0xc0000000003fc000
    pid   = 10886, comm = 9.3.111.204-rec
3:mon> t
[c0000000d4e6fef0] d0000000004f8014 .reclaimer+0x180/0x2cc [nfs]
[c0000000d4e6ff90] c000000000018e48 .kernel_thread+0x4c/0x6c
3:mon> r
R00 = 0000000000000000   R16 = 0000000000000000
R01 = c0000000d4e6fe50   R17 = 0000000000000000
R02 = d00000000052e3e8   R18 = 0000000000000000
R03 = 0000000000000040   R19 = 0000000000000000
R04 = 8000000000009032   R20 = 0000000000230000
R05 = 0000000000000000   R21 = 0000000000000000
R06 = 0000000000000080   R22 = 00000000001cb800
R07 = 0000000000000000   R23 = 0000000000000000
R08 = 0000000000000018   R24 = c0000000003fa800
R09 = c00000000043aec0   R25 = 0000000001300000
R10 = c00000000117bbd8   R26 = c0000001dd610b50
R11 = c00000000043aec0   R27 = c0000001dd610b00
R12 = 0000000044000028   R28 = c0000001dd62f860
R13 = c0000000003fc000   R29 = 0000000000100100
R14 = 0000000000000000   R30 = d00000000052b3c8
R15 = 0000000000000000   R31 = c000000129de86a0
pc  = d0000000004f7e54 .nfs4_reclaim_open_state+0x144/0x184 [nfs]
lr  = d0000000004f7e3c .nfs4_reclaim_open_state+0x12c/0x184 [nfs]
msr = 8000000000009032   cr  = 24000024
ctr = c00000000005d9b8   xer = 000000000000000e   trap =      300


Machine Type = IVM lpar of P6 blade (JS22)

Contact Information = Miao Tao Feng/fengmt.com

Comment 1 IBM Bug Proxy 2008-01-07 17:40:39 UTC

------- Comment From ffilz.com 2008-01-07 12:39 EDT-------
Jeff, any update on this?

Comment 2 IBM Bug Proxy 2008-01-28 17:48:32 UTC

------- Comment From ffilz.com 2008-01-28 12:40 EDT-------
Since this was found on an unofficial kernel, and the parent bug 37993 is
marked for acceptance into a maintenance release, I am going to reject this for
now. When the maintenance release comes out, if this bug still shows up there,
we can reopen this bug.

Comment 4 Jeff Layton 2008-03-05 14:12:38 UTC

-116 == -ESTALE

This was seen in context of the state recovery thread. Work generally gets
queued to that thread when the server returns an error and we need to "reset"
the open/lock state on the file. When we tried to recover the state here, we got
back -ESTALE, so the filehandle was no longer any good. It's possible that the
same issue that caused us to get an ESTALE when trying to recover the state was
what originally caused the state recovery attempt in the first place.

Did something happen on the server at or maybe a little while before this
problem occurred?

What kind of server are you testing against, and what sort of tests were you
running?

Do we have a coredump from this testing? If so that might be a way to gather a
bit more info about what sort of error we got back from the server...

Comment 6 Jeff Layton 2008-08-01 16:49:16 UTC

No info in several months and IT is now closed. Closing BZ with resolution of
INSUFFICIENT_DATA.