NOTE: Please see https://bugzilla.redhat.com/show_bug.cgi?id=402581#c9 by Jeff. This should be assigned to him (jlayton). Thanks! =Comment: #0================================================= Miao Tao Feng <fengmt.com> - 2007-12-18 22:10 EDT ---Problem Description--- After the stress testing(BASE, IO, TCP, NFS focus areas) ran for about 30 hours on uli03, the system droped into xmon and dmesg has a lots of error messages "nfs4_reclaim_open_state: unhandled error -116. Zeroing state". The kernel on uli03 is 2.6.9-68.1.EL.jtltest.25 and it tried to fix bug 37993. We got it from http://people.redhat.com/jlayton/ . XMON INFO: 3:mon> e cpu 0x3: Vector: 300 (Data Access) at [c0000000d4e6fbd0] pc: d0000000004f7e54: .nfs4_reclaim_open_state+0x144/0x184 [nfs] lr: d0000000004f7e3c: .nfs4_reclaim_open_state+0x12c/0x184 [nfs] sp: c0000000d4e6fe50 msr: 8000000000009032 dar: 100100 dsisr: 40000000 current = 0xc0000001d33c35a0 paca = 0xc0000000003fc000 pid = 10886, comm = 9.3.111.204-rec 3:mon> t [c0000000d4e6fef0] d0000000004f8014 .reclaimer+0x180/0x2cc [nfs] [c0000000d4e6ff90] c000000000018e48 .kernel_thread+0x4c/0x6c 3:mon> r R00 = 0000000000000000 R16 = 0000000000000000 R01 = c0000000d4e6fe50 R17 = 0000000000000000 R02 = d00000000052e3e8 R18 = 0000000000000000 R03 = 0000000000000040 R19 = 0000000000000000 R04 = 8000000000009032 R20 = 0000000000230000 R05 = 0000000000000000 R21 = 0000000000000000 R06 = 0000000000000080 R22 = 00000000001cb800 R07 = 0000000000000000 R23 = 0000000000000000 R08 = 0000000000000018 R24 = c0000000003fa800 R09 = c00000000043aec0 R25 = 0000000001300000 R10 = c00000000117bbd8 R26 = c0000001dd610b50 R11 = c00000000043aec0 R27 = c0000001dd610b00 R12 = 0000000044000028 R28 = c0000001dd62f860 R13 = c0000000003fc000 R29 = 0000000000100100 R14 = 0000000000000000 R30 = d00000000052b3c8 R15 = 0000000000000000 R31 = c000000129de86a0 pc = d0000000004f7e54 .nfs4_reclaim_open_state+0x144/0x184 [nfs] lr = d0000000004f7e3c .nfs4_reclaim_open_state+0x12c/0x184 [nfs] msr = 8000000000009032 cr = 24000024 ctr = c00000000005d9b8 xer = 000000000000000e trap = 300 Machine Type = IVM lpar of P6 blade (JS22) Contact Information = Miao Tao Feng/fengmt.com
------- Comment From ffilz.com 2008-01-07 12:39 EDT------- Jeff, any update on this?
------- Comment From ffilz.com 2008-01-28 12:40 EDT------- Since this was found on an unofficial kernel, and the parent bug 37993 is marked for acceptance into a maintenance release, I am going to reject this for now. When the maintenance release comes out, if this bug still shows up there, we can reopen this bug.
-116 == -ESTALE This was seen in context of the state recovery thread. Work generally gets queued to that thread when the server returns an error and we need to "reset" the open/lock state on the file. When we tried to recover the state here, we got back -ESTALE, so the filehandle was no longer any good. It's possible that the same issue that caused us to get an ESTALE when trying to recover the state was what originally caused the state recovery attempt in the first place. Did something happen on the server at or maybe a little while before this problem occurred? What kind of server are you testing against, and what sort of tests were you running? Do we have a coredump from this testing? If so that might be a way to gather a bit more info about what sort of error we got back from the server...
No info in several months and IT is now closed. Closing BZ with resolution of INSUFFICIENT_DATA.