Bug 860399

Summary: RHEL6.3: Oops in rpciod, RIP rpcauth_refreshcred
Product: Red Hat Enterprise Linux 6 Reporter: Kelsey Cummings <kgc>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.3CC: dwysocha, jlayton, nfs-maint, steved
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 20:34:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kelsey Cummings 2012-09-25 18:10:34 UTC
Description of problem:

Kernel Oops in rpciod under both 2.6.32-279.5.1.el6.x86_64 and 2.6.32-279.5.2.el6.x86_64.  The system is a dovecot mail server running with about ~3,500 connections, the recent crash on 2.6.32-279.5.2 nfs ops/sec had spiked to  750 over nominal of 300 due to load shifting in the cluster.

How reproducible:

Unknown, likely load related.

BT from 2.6.32-279.5.1, 2.6.32-279.5.2 crash is also at rpcauth_refreshcred+158

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.5.1.el6.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 16
        DATE: Mon Sep  3 10:07:17 2012
      UPTIME: 10 days, 12:45:19
LOAD AVERAGE: 5.83, 5.07, 4.54
       TASKS: 3392
    NODENAME: x.x.sonic.net
     RELEASE: 2.6.32-279.5.1.el6.x86_64
     VERSION: #1 SMP Tue Aug 14 16:11:42 CDT 2012
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 32 GB
       PANIC: ""
         PID: 1928
     COMMAND: "rpciod/9"
        TASK: ffff880432adb540  [THREAD_INFO: ffff880430af2000]
         CPU: 9
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 1928   TASK: ffff880432adb540  CPU: 9   COMMAND: "rpciod/9"
 #0 [ffff880430af3ae0] machine_kexec at ffffffff8103281b
 #1 [ffff880430af3b40] crash_kexec at ffffffff810ba792
 #2 [ffff880430af3c10] oops_end at ffffffff815013c0
 #3 [ffff880430af3c40] die at ffffffff8100f26b
 #4 [ffff880430af3c70] do_general_protection at ffffffff81500f52
 #5 [ffff880430af3ca0] general_protection at ffffffff81500725
    [exception RIP: rpcauth_refreshcred+158]
    RIP: ffffffffa029a85e  RSP: ffff880430af3d50  RFLAGS: 00010286
    RAX: 6e6967756c705f66  RBX: ffff88032977c6c8  RCX: ffff8804318e1800
    RDX: 0000000000000001  RSI: ffff88011922d3c0  RDI: ffff88032977c6c8
    RBP: ffff880430af3d90   R8: ffff880432b136b8   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000001
    R13: ffff880432b13600  R14: 0000000000000001  R15: ffffffffa028db80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff880430af3d98] call_refresh at ffffffffa028dbc0 [sunrpc]
 #7 [ffff880430af3db8] __rpc_execute at ffffffffa0298e37 [sunrpc]
 #8 [ffff880430af3e28] rpc_async_schedule at ffffffffa02991c5 [sunrpc]
 #9 [ffff880430af3e38] worker_thread at ffffffff8108c760
#10 [ffff880430af3ee8] kthread at ffffffff81091d66
#11 [ffff880430af3f48] kernel_thread at ffffffff8100c14a

Comment 2 Jeff Layton 2012-10-12 10:31:29 UTC
Would it be possible for you to open a support case with RH support and supply them with the core for analysis?

Comment 3 Jeff Layton 2012-10-12 17:01:27 UTC
(pasting from what was sent via email)

> > Would it be possible for you to open a support case with RH support and supply
> > them with the core for analysis?
> 
> No, but I'd be happy to supply you with the two crash dumps and/or
> perform any analysis for you locally.  New support policies make 
> community reported bugs more chalenging, eh?
> 

Yep, if you don't have a support contract then you'll need to do some
legwork on your own.

You'll want to track down the place where it crashed and see if you can determine why. Most likely, there's a corrupt pointer someplace that we ended up trying to chase. See if you can determine what was corrupt and the nature of that corruption...

Comment 4 Jeff Layton 2012-12-04 20:34:25 UTC
It's likely that this bug is a duplicate of bug 878204. Unfortunately, that bug is marked private and I can't add you to the cc list.

You may want to try pulling in commit a271c5a0de from upstream kernels and see if that fixes the issue for you. If it does, please note it here and I'll close this bug as a duplicate of that one.

I'm going to go ahead and close this bug as a dup of that one. If you find that that commit doesn't help, then please reopen this bug and I'll try to take another look.

*** This bug has been marked as a duplicate of bug 878204 ***