We are receiving a kernel panic on two production servers. I've reproduced the problem on identical hardware in our test lab. These particular servers are running a Java SE application handling UDP packets (specifically, snmp traps and radius requests). They have a single NFSv4 mount point. As strange as it seems, identical hardware without the java application run fine, and have never crashed. Removing the NFS mountpoint from the production servers also eliminates the problem. It appears that the kernel panics with the combination of Java application + NFSv4 mount. In the test lab, I can crash both boxes within 24 hours. The production servers die much less often (perhaps once a week). Servers that are not running the java application have never panicked. Unmounting the NFSv4 mount also resolves the problem. I am attaching the kernel panic, and will open a case with RH Support.
Created attachment 319358 [details] Kernel panic output
Created attachment 319455 [details] Output of crash report tool
An update... I can now reliably reproduce this in the lab. Per Kostas Georgiou's observation in bug #209887, it definitely has something to do with the kerb ticket timing out. If I set my ticket lifetime to 30 minutes and fire off a script that is constantly writing to an NFS share... the kernel panics after 30 minutes. Removing the suspect Java process appears to have no effect. I'm not sure why I'm only seeing the problem on production servers running Java, this could be a fluke.
<steved> normelton: what are you using to reproduce it? <normelton> i'm logging in locally to the box, which grabs a 30 min ticket, then doing repeated dd's to the NFS share <normelton> write a 1GB file, delete it, sleep 5 seconds, repeat <normelton> the servers crash like clockwork :) <steved> normelton: every 30mins (i.e. every time the ticket expires0? <normelton> i just turned on rpc.gssd logging <normelton> steved: exactly... ticket expires, box dies within a few minutes
Created attachment 319596 [details] rpc.gssd log This is the rpc.gssd log file at the time of the crash.
Simply logging in without doing any continuous NFS activity does not cause a crash.
It is interesting that both callers of __gss_find_upcall, gss_add_msg and gss_pipe_downcall, do take the gss_auth->lock spinlock. I wonder if this is caused by some use-after-free (when an upcall gets recycled but some code still had a pointer to it from another incarnation) or some other kind of upcall member manipulation outside the spinlock. gss_add_msg is an inline, which accounts for the crash inside gss_setup_upcall in IT 230957.
Has there been any progress on this bug? I know there has been some work on NFSv4 in later kernels. Would it be worth trying to reproduce this with the latest Fedora kernels? I can work on that if it would help. If the problem doesn't appear there, it seems like it would be fairly simple to backport the relevant patches. Unfortunately, that's a little beyond my expertise.
Created attachment 322635 [details] Calltrace systemtap to inspect auth_rpcgss and rpcsec_gss_krb5 Norman, If you are comfortable using systemtap, can you please load the probes attached here and reproduce the bug? It will generate a call trace of any thread that calls into the auth_rpcgss module, and will let us know the call chains that lead to the problem. It is a bit of an extreme probe, but if we can get it to work it should really tell us what the processes are doing and if there is a race somewhere we should see it. Thanks! Fabio Olive
Fabio, Systemtap can only hang on for a few minutes when I'm pushing lots of I/O to NFS. After about 10-15 seconds, it dies with one of the following errors: ERROR: probe overhead exceeded threshold WARNING: Number of errors: 1, skipped probes: 0 WARNING: There were 81341 transport failures. If you have any suggestions, I can certainly try anything. Sorry it wasn't more help! Norman
You could take out the .return probes: probe module("auth_rpcgss").function("*").call, module("rpcsec_gss_krb5").function("*").call { printf("%s %s\n", probefunc(), $$parms) } You could run stap with "-DSTP_NO_OVERLOAD" and or "-s 256".
Created attachment 322743 [details] Last 10K lines of systemtap output Per Frank's suggestions, I ran with STP_NO_OVERLOAD and have attached the last 10K lines from my systemtap output. I'm setting up a FC10 box now to see if I can reproduce the problem with a newer kernel. Norman
I am pretty sure that the problem was fixed by kernel-2.6.20-1.2925.fc6. Since that upgrade I haven't seen the bug again up to F9. From what I remember there were some changes between 2.6.18 and 2.6.20 in the relevant code.
From our local reproducer, these are the contents of the struct gss_auth that __gss_find_upcall was iterating over: crash> struct gss_auth 0xffffffff881c36a0 struct gss_auth { rpc_auth = { au_cslack = 0x1d, au_rslack = 0x43, au_verfsize = 0xc, au_flags = 0x0, au_ops = 0xffffffff881c3600, au_flavor = 0x1, au_count = { counter = 0x1 }, au_credcache = 0xffffffff881c36e0 }, mech = 0x0, service = 0, upcalls = { next = 0x0, prev = 0x0 }, client = 0x0, dentry = 0x0, path = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0 00\000\000\000\000\000\000\000\000\000\000>\234<FF><FF>\000\000\000", lock = { raw_lock = { slock = 0x3a97 } } } It looks severely broken, as the upcalls list is NULL and the lock value looks completely corrupt. Trying to find what could corrupt the struct gss_auth like that. It reproduced in a uni-processor Xen machine, even. Cheers, Fabio Olive
Kostas, Interesting... if you really haven't seen this issue since 2.6.20, it limits our search space a little. $ git log --pretty=oneline v2.6.18..v2.6.20 net/sunrpc | wc -l 133 This one looks interesting: bde8f00ce64d9824a4f227c8594e335a1a10d044 [PATCH] NFS: Fix Oops in rpc_call_sync() It could result in one task to remain working on an RPC task that was just released by another thread, thus causing it to dereference cleaned up fields such as the cl_auth->upcalls.
Actually, spoke a little too soon. 2.6.18-92.el5 has that.
I tried to reproduce this with F9 and this panic is not reproducible there. When tickets are expired, writing to the nfs mount simply gets stopped without any panic. Linux hostname 2.6.25-2.fc9.x86_64.xen #1 Fri Apr 25 13:13:13 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux So the fix is already upstream, I think.
Any update on this bug? If there's any chance of getting a test kernel or a hotfix, I'd be willing to try it out! Thanks Norman
Regarding comment #18, did you happen to have memory poisoning enabled when you tested that? It would be interesting to run the same reproducer on a kernel-debug, that might give a better indication of what's actually happening to the corrupt gss_auth struct.
FWIW, I tried reproducing this: 1) Set up a client to mount a krb5 mount 2) had an unprivileged user get a krb5 ticket with a 5m timeout 3) dd'ed 1GB of /dev/zero to a file on the mount as that user 4) deleted the file 5) slept 5s and goto 3 ...the box did not crash. After the ticket expired, I got: rm: cannot remove `/mnt/krb5/testuser/testfile': Permission denied dd: opening `/mnt/krb5/testuser/testfile': Permission denied ...and messages like this in the ring buffer: call_verify: auth check failed call_verify: auth check failed call_verify: auth check failed call_verify: auth check failed call_verify: auth check failed call_verify: auth check failed call_verify: auth check failed call_verify: auth check failed RPC call_verify: retry failed, exit EIO call_verify: auth check failed RPC call_verify: retry failed, exit EIO call_verify: auth check failed call_verify: auth check failed RPC call_verify: retry failed, exit EIO call_verify: auth check failed RPC call_verify: retry failed, exit EIO ...this host was running kernel-debug-2.6.18-128.el5.jtltest.57.x86_64 from my people.redhat.com page. Is there something else I should be doing to reproduce this? If that sounds like how you're reproducing it, it might be interesting to try a 5.3-ish kernel and see if this is still reproducible there.
I don't see any evidence of poisoning patterns in the corrupt gss_auth struct. I think we need to step back and see whether some earlier pointers in the chain might be bad as well. The structs involved here are pretty convoluted but, the rpc_task struct looks fine: crash> struct rpc_task.tk_magic ffff81001656fe20 tk_magic = 0xf00baa, The rpc_clnt attached to that struct also looks OK AFAICT. ...and the rpc_auth struct also looks valid: crash> struct rpc_auth 0xffffffff88399520 struct rpc_auth { au_cslack = 0x1d, au_rslack = 0x3f, au_verfsize = 0xc, au_flags = 0x0, au_ops = 0xffffffff88399480, au_flavor = 0x1, au_count = { counter = 0x1 }, au_credcache = 0xffffffff88399560 } ...but the au_flavor is RPC_AUTH_UNIX. I think it should be RPC_AUTH_GSS (0x6) instead. The gss_auth struct we're working with is determined from the container of this struct: struct gss_auth *gss_auth = container_of(task->tk_client->cl_auth, struct gss_auth, rpc_auth); ...so it doesn't seem to surprising that it looks bogus once we start looking at it. It might be interesting to add some BUG_ON's to sanity check the au_flavor.
RPC credops attached to tk_msg are definitely AUTH_GSS ones: crash> struct rpc_credops 0xffffffff88406b80 struct rpc_credops { cr_name = 0xffffffff88401e92 "AUTH_GSS", cr_init = 0xffffffff883fd759 <gss_cred_init>, crdestroy = 0xffffffff883fdc21 <gss_destroy_cred>, crmatch = 0xffffffff883fd000 <init_rpcsec_gss>, crmarshal = 0xffffffff883fdc5d <gss_marshal>, crrefresh = 0xffffffff883fd8ad <gss_refresh>, crvalidate = 0xffffffff883fddef <gss_validate>, crwrap_req = 0xffffffff883fdf47 <gss_wrap_req>, crunwrap_resp = 0xffffffff883fd284 <gss_unwrap_resp> } ...so to summarize, we have: rpc_task.tk_msg.rpc_cred.cr_ops which points to auth_gss ops ...and... rpc_task->tk_client->cl_auth which points to an auth_unix rpc_auth struct ...I think this is the crux of the problem but am not yet sure how it ends up this way.
Jeff, Earlier you mentioned this patch: SUNRPC: Ensure that the struct gss_auth lifetime exceeds the credential's Maybe the auth_gss structure is being cleaned up and reused as an auth_unix struct while rpc_task->tk_client still had a pointer to it trusting on the old values? I do not know if the auth_* structures are allocated out of a common cache or not, but it does look like use-after-free, specially when there is a probable refcount/lifetime issue in the code. Cheers, Fabio Olive
The cl_auth gets set in rpcauth_create. What seems to be happening is that we're creating 2 rpc_auth structs for 2 different tk_client's. When the delayed work to renew the cred is being done, it looks like it has the "wrong" tk_client. Here's a bit of debugging from rpcauth_create that shows the pseudoflavor, rpc_clnt address and rpc_auth addr, plus a stack trace of how it got there. Given that this should be a krb5p mount, it's not clear to me why we're creating an rpc_clnt with a pseudoflavor of 1 here. rpcauth_create: pseudoflavor=1 rpc_clnt=ffff8100115f1118 auth=ffffffff88394820 Call Trace: [<ffffffff88379ad2>] :sunrpc:rpcauth_create+0x8d/0x95 [<ffffffff88374235>] :sunrpc:rpc_new_client+0x1d9/0x299 [<ffffffff80159cc3>] __spin_lock_init+0x29/0x47 [<ffffffff8837464d>] :sunrpc:rpc_create_client+0xb/0x3c [<ffffffff884c4ba0>] :nfs:nfs_create_rpc_client+0x11b/0x15c [<ffffffff884c5182>] :nfs:nfs4_set_client+0x9f/0x174 [<ffffffff884c58d7>] :nfs:nfs4_create_server+0x7a/0x394 [<ffffffff884cc778>] :nfs:nfs_copy_user_string+0x3c/0x89 [<ffffffff80013216>] poison_obj+0x26/0x2f [<ffffffff884cc778>] :nfs:nfs_copy_user_string+0x3c/0x89 [<ffffffff8000cc34>] cache_alloc_debugcheck_after+0xef/0x1c1 [<ffffffff800e2443>] __kmalloc+0x11d/0x129 [<ffffffff884cd0d4>] :nfs:nfs4_get_sb+0x1fc/0x323 [<ffffffff800d33fe>] zone_statistics+0x3e/0x6d [<ffffffff8000ac30>] get_page_from_freelist+0x3ad/0x450 [<ffffffff800eaeb2>] vfs_kern_mount+0x93/0x11a [<ffffffff800eaf7b>] do_kern_mount+0x36/0x4d [<ffffffff800f5311>] do_mount+0x6af/0x71f [<ffffffff800f3ed3>] copy_mount_options+0xce/0x127 [<ffffffff800a0cbb>] search_exception_tables+0x1d/0x2d [<ffffffff8006acce>] do_page_fault+0x650/0x835 [<ffffffff800d33fe>] zone_statistics+0x3e/0x6d [<ffffffff8000aab5>] get_page_from_freelist+0x232/0x450 [<ffffffff8000f9c1>] __alloc_pages+0x76/0x2df [<ffffffff8004e5b9>] sys_mount+0x8a/0xcd [<ffffffff800602a6>] tracesys+0xd5/0xdf rpcauth_create: pseudoflavor=390005 rpc_clnt=ffff8100115f1330 auth=ffff810017df0b60 Call Trace: [<ffffffff88379ad2>] :sunrpc:rpcauth_create+0x8d/0x95 [<ffffffff884c4332>] :nfs:nfs_init_server_rpcclient+0x5e/0xc0 [<ffffffff884c5a1e>] :nfs:nfs4_create_server+0x1c1/0x394 [<ffffffff884cc778>] :nfs:nfs_copy_user_string+0x3c/0x89 [<ffffffff80013216>] poison_obj+0x26/0x2f [<ffffffff884cc778>] :nfs:nfs_copy_user_string+0x3c/0x89 [<ffffffff8000cc34>] cache_alloc_debugcheck_after+0xef/0x1c1 [<ffffffff800e2443>] __kmalloc+0x11d/0x129 [<ffffffff884cd0d4>] :nfs:nfs4_get_sb+0x1fc/0x323 [<ffffffff800d33fe>] zone_statistics+0x3e/0x6d [<ffffffff8000ac30>] get_page_from_freelist+0x3ad/0x450 [<ffffffff800eaeb2>] vfs_kern_mount+0x93/0x11a [<ffffffff800eaf7b>] do_kern_mount+0x36/0x4d [<ffffffff800f5311>] do_mount+0x6af/0x71f [<ffffffff800f3ed3>] copy_mount_options+0xce/0x127 [<ffffffff800a0cbb>] search_exception_tables+0x1d/0x2d [<ffffffff8006acce>] do_page_fault+0x650/0x835 [<ffffffff800d33fe>] zone_statistics+0x3e/0x6d [<ffffffff8000aab5>] get_page_from_freelist+0x232/0x450 [<ffffffff8000f9c1>] __alloc_pages+0x76/0x2df [<ffffffff8004e5b9>] sys_mount+0x8a/0xcd [<ffffffff800602a6>] tracesys+0xd5/0xdf
I don't think that's the case. I instrumented rpcauth_create and see these calls: rpcauth_create: pseudoflavor=1 rpc_clnt=ffff8100115f1978 auth=ffffffff88394820 rpcauth_create: pseudoflavor=1 rpc_clnt=ffff8100115f1118 auth=ffffffff88394820 rpcauth_create: pseudoflavor=1 rpc_clnt=ffff8100115f1118 auth=ffffffff88394820 rpcauth_create: pseudoflavor=390005 rpc_clnt=ffff8100115f1330 auth=ffff810017df0b60 ...the problem with that theory is that I'm not seeing that auth struct (pseudoflavor 390005) get reallocated as an AUTH_UNIX struct. I don't think this is a use-after-free problem.
To summarize the reproducer: 1) mount up an export on a RHEL5 client using sec=krb5p (regular krb5 or krb5i may also work, I'm not sure) 2) become an unprivileged user that has write permissions on the mount 3) get a krb5 ticket as that user with a 10m timeout: $ kinit -l 10m 4) run this loop as the unprivileged user (substitute paths where appropriate) while true; do dd if=/dev/zero of=/mnt/krb5/testuser/testfile bs=1MB count=512 rm -f /mnt/krb5/testuser/testfile sleep 5 done ...eventually the krb5 ticket will time out and a little while later the box will crash. I've tried using shorter krb5 ticket timeouts but it didn't seem to reproduce, so there is some timing involved.
A little more info. I did a bit more instrumentation and got a new core. This core paniced with rpc_task == 0xffff8100131c7060. That struct had the common situation of Some debug messages: rpcauth_bindcred: task=ffff8100131c7060 au_name=UNIX cr_name=AUTH_UNIX tk_client=ffff81000a7b22b0 rpcauth_bindcred: task=ffff8100131c7060 au_name=UNIX cr_name=AUTH_UNIX tk_client=ffff81000a7b2098 rpcauth_bindcred: task=ffff8100131c7060 au_name=RPCSEC_GSS cr_name=AUTH_GSS tk_client=ffff81000bb5dd68 I think this means that the rpc_task was recycled a couple of times. When rpcauth_bindcred was called, tk_client was at 0xffff81000bb5dd68, but in the core, it's set to 0xffff81000a7b2098. It looks like the tk_client gets set in rpc_init_task, so I'm going to instrument that next...
This problem is also reproducible with sec=krb5.
The interesting bit is that the stack trace is always in the state renewal code: Process events/0 (pid: 8, threadinfo ffff81001fccc000, task ffff81001f7fe240) Stack: ffff81001fba12f0 ffffffff883fb6f0 ffff81001a3b6858 ffff81001a3b6858 ffff81001367e060 ffff81001a3b6858 ffffffff88397520 ffffffff883fb927 ffff81001367e060 ffff81001a3b6858 0000000000000000 ffffffff884ed000 Call Trace: [<ffffffff883fb6f0>] :auth_rpcgss:gss_setup_upcall+0x93/0xfc [<ffffffff883fb927>] :auth_rpcgss:gss_refresh+0x7a/0x139 [<ffffffff883794eb>] :sunrpc:rpcauth_refreshcred+0x48/0x53 [<ffffffff88378d82>] :sunrpc:__rpc_execute+0x92/0x250 [<ffffffff88373bb0>] :sunrpc:rpc_call_async+0x6f/0xa3 [<ffffffff884e02e0>] :nfs:nfs4_proc_async_renew+0x45/0x4a [<ffffffff884e8728>] :nfs:nfs4_renew_state+0xda/0x1c2 [<ffffffff8004f86f>] run_workqueue+0x9a/0xf4 [<ffffffff8004c087>] worker_thread+0x0/0x122 [<ffffffff8004c177>] worker_thread+0xf0/0x122 [<ffffffff8008ebaf>] default_wake_function+0x0/0xe [<ffffffff800a7008>] trace_hardirqs_on+0x11b/0x13f [<ffffffff80034306>] kthread+0xfe/0x132 [<ffffffff8006808c>] trace_hardirqs_on_thunk+0x35/0x37 [<ffffffff80061079>] child_rip+0xa/0x11 [<ffffffff8006898d>] _spin_unlock_irq+0x24/0x27 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff80034208>] kthread+0x0/0x132 [<ffffffff8006106f>] child_rip+0x0/0x11 nfs4_renew_state calls this: cred = nfs4_get_renew_cred(clp); ...and then this: nfs4_proc_async_renew(clp, cred); ...nfs4_proc_async_renew then sets up an rpc call with the cred that it found and the rpc_clnt attached to the nfs_client. The mismatch comes in at this point, I think. It seems like nfs4_get_renew_cred is returning GSSAPI credentials even though the rpc_clnt has AUTH_UNIX ops.
Created attachment 328546 [details] patch -- create rpc clients with proper auth flavor This patch seems to fix the problem and loosely matches how this code was changed upstream. It seems to work and I haven't seen any regressions yet, but it needs more testing. I'm not terribly familiar with the intricacies of the RPC auth code. Comments appreciated...
If you have a test kernel, I'd be happy to try it out on our end. Thanks for your help!
The latest RHEL5 test kernels on my people.redhat.com page should have this patch: http://people.redhat.com/jlayton If you're able to test them someplace non-critical, then please report back whether they fix the problem...
Good news! It works like a champ. Thanks for your help on this.
Thanks for testing it. I still want to do a bit more testing, but I'm 90% certain that it's good at this point.
in kernel-2.6.18-129.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Updating PM score.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html