465456 – Kernel panic in auth_rpcgss:__gss_find_upcall

Bug 465456 - Kernel panic in auth_rpcgss:__gss_find_upcall

Summary: Kernel panic in auth_rpcgss:__gss_find_upcall

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	481119 483701 485920 546236
TreeView+	depends on / blocked

Reported:	2008-10-03 13:03 UTC by Norman Elton
Modified:	2018-10-27 11:41 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:07:56 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Kernel panic output (2.81 KB, text/plain) 2008-10-03 13:04 UTC, Norman Elton	no flags	Details
Output of crash report tool (35.27 KB, text/plain) 2008-10-04 12:24 UTC, Norman Elton	no flags	Details
rpc.gssd log (41.16 KB, text/plain) 2008-10-06 19:55 UTC, Norman Elton	no flags	Details
Calltrace systemtap to inspect auth_rpcgss and rpcsec_gss_krb5 (304 bytes, text/plain) 2008-11-05 19:29 UTC, Fabio Olive Leite	no flags	Details
Crash logs from the paniced system. (10.29 KB, text/x-log) 2008-11-06 12:03 UTC, Sadique Puthen	no flags	Details
Last 10K lines of systemtap output (404.57 KB, text/plain) 2008-11-06 15:52 UTC, Norman Elton	no flags	Details
Kernel crash log from kernel-debug (37.06 KB, text/plain) 2009-01-07 11:20 UTC, Sadique Puthen	no flags	Details
patch -- create rpc clients with proper auth flavor (1.59 KB, patch) 2009-01-09 14:00 UTC, Jeff Layton	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	209887	0	medium	CLOSED	Kernel Oops at auth_rpcgss:__gss_find_upcall	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Norman Elton 2008-10-03 13:03:53 UTC

We are receiving a kernel panic on two production servers. I've reproduced the problem on identical hardware in our test lab.

These particular servers are running a Java SE application handling UDP packets (specifically, snmp traps and radius requests). They have a single NFSv4 mount point. As strange as it seems, identical hardware without the java application run fine, and have never crashed. Removing the NFS mountpoint from the production servers also eliminates the problem.

It appears that the kernel panics with the combination of Java application + NFSv4 mount.

In the test lab, I can crash both boxes within 24 hours. The production servers die much less often (perhaps once a week). Servers that are not running the java application have never panicked. Unmounting the NFSv4 mount also resolves the problem.

I am attaching the kernel panic, and will open a case with RH Support.

Comment 1 Norman Elton 2008-10-03 13:04:33 UTC

Created attachment 319358 [details]
Kernel panic output

Comment 2 Norman Elton 2008-10-04 12:24:27 UTC

Created attachment 319455 [details]
Output of crash report tool

Comment 3 Norman Elton 2008-10-06 15:14:57 UTC

An update...

I can now reliably reproduce this in the lab. Per Kostas Georgiou's observation in bug #209887, it definitely has something to do with the kerb ticket timing out. If I set my ticket lifetime to 30 minutes and fire off a script that is constantly writing to an NFS share... the kernel panics after 30 minutes.

Removing the suspect Java process appears to have no effect. I'm not sure why I'm only seeing the problem on production servers running Java, this could be a fluke.

Comment 4 Steve Dickson 2008-10-06 18:50:35 UTC

<steved> normelton: what are you using to reproduce it?
<normelton> i'm logging in locally to the box, which grabs a 30 min ticket, then doing repeated dd's to the NFS share
<normelton> write a 1GB file, delete it, sleep 5 seconds, repeat
<normelton> the servers crash like clockwork :)
<steved> normelton: every 30mins (i.e. every time the ticket expires0?
<normelton> i just turned on rpc.gssd logging
<normelton> steved: exactly... ticket expires, box dies within a few minutes

Comment 5 Norman Elton 2008-10-06 19:55:41 UTC

Created attachment 319596 [details]
rpc.gssd log

This is the rpc.gssd log file at the time of the crash.

Comment 6 Norman Elton 2008-10-07 16:06:21 UTC

Simply logging in without doing any continuous NFS activity does not cause a crash.

Comment 9 Fabio Olive Leite 2008-10-30 20:31:00 UTC

It is interesting that both callers of __gss_find_upcall, gss_add_msg and gss_pipe_downcall, do take the gss_auth->lock spinlock. I wonder if this is caused by some use-after-free (when an upcall gets recycled but some code still had a pointer to it from another incarnation) or some other kind of upcall member manipulation outside the spinlock. gss_add_msg is an inline, which accounts for the crash inside gss_setup_upcall in IT 230957.

Comment 10 Norman Elton 2008-11-03 15:35:25 UTC

Has there been any progress on this bug?

I know there has been some work on NFSv4 in later kernels. Would it be worth trying to reproduce this with the latest Fedora kernels? I can work on that if it would help.

If the problem doesn't appear there, it seems like it would be fairly simple to backport the relevant patches. Unfortunately, that's a little beyond my expertise.

Comment 11 Fabio Olive Leite 2008-11-05 19:29:49 UTC

Created attachment 322635 [details]
Calltrace systemtap to inspect auth_rpcgss and rpcsec_gss_krb5

Norman,

If you are comfortable using systemtap, can you please load the probes attached here and reproduce the bug? It will generate a call trace of any thread that calls into the auth_rpcgss module, and will let us know the call chains that lead to the problem.

It is a bit of an extreme probe, but if we can get it to work it should really tell us what the processes are doing and if there is a race somewhere we should see it.

Thanks!
Fabio Olive

Comment 14 Norman Elton 2008-11-06 15:20:01 UTC

Fabio,

Systemtap can only hang on for a few minutes when I'm pushing lots of I/O to
NFS. After about 10-15 seconds, it dies with one of the following errors:

ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
WARNING: There were 81341 transport failures.

If you have any suggestions, I can certainly try anything. Sorry it wasn't more
help!

Norman

Comment 15 Frank Ch. Eigler 2008-11-06 15:28:15 UTC

You could take out the .return probes:

probe module("auth_rpcgss").function("*").call, module("rpcsec_gss_krb5").function("*").call
{
	printf("%s %s\n", probefunc(), $$parms)
}

You could run stap with "-DSTP_NO_OVERLOAD" and or "-s 256".

Comment 16 Norman Elton 2008-11-06 15:52:28 UTC

Created attachment 322743 [details]
Last 10K lines of systemtap output

Per Frank's suggestions, I ran with STP_NO_OVERLOAD and have attached the last 10K lines from my systemtap output.

I'm setting up a FC10 box now to see if I can reproduce the problem with a newer kernel.

Norman

Comment 17 Kostas Georgiou 2008-11-06 16:22:16 UTC

I am pretty sure that the problem was fixed by kernel-2.6.20-1.2925.fc6. Since that upgrade I haven't seen the bug again up to F9.

From what I remember there were some changes between 2.6.18 and 2.6.20 in the relevant code.

Comment 18 Fabio Olive Leite 2008-11-06 19:03:15 UTC

From our local reproducer, these are the contents of the struct gss_auth that __gss_find_upcall was iterating over:

crash> struct gss_auth 0xffffffff881c36a0
struct gss_auth {
  rpc_auth = {
    au_cslack = 0x1d,
    au_rslack = 0x43,
    au_verfsize = 0xc,
    au_flags = 0x0,
    au_ops = 0xffffffff881c3600,
    au_flavor = 0x1,
    au_count = {
      counter = 0x1
    },
    au_credcache = 0xffffffff881c36e0
  },
  mech = 0x0,
  service = 0,
  upcalls = {
    next = 0x0,
    prev = 0x0
  },
  client = 0x0,
  dentry = 0x0,
  path = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
00\000\000\000\000\000\000\000\000\000\000>\234<FF><FF>\000\000\000",
  lock = {
    raw_lock = {
      slock = 0x3a97
    }
  }
}

It looks severely broken, as the upcalls list is NULL and the lock value looks completely corrupt.

Trying to find what could corrupt the struct gss_auth like that. It reproduced in a uni-processor Xen machine, even.

Cheers,
Fabio Olive

Comment 19 Fabio Olive Leite 2008-11-06 20:12:10 UTC

Kostas,

Interesting... if you really haven't seen this issue since 2.6.20, it limits our search space a little.

$ git log --pretty=oneline v2.6.18..v2.6.20 net/sunrpc | wc -l
133

This one looks interesting:

bde8f00ce64d9824a4f227c8594e335a1a10d044 [PATCH] NFS: Fix Oops in rpc_call_sync()

It could result in one task to remain working on an RPC task that was just released by another thread, thus causing it to dereference cleaned up fields such as the cl_auth->upcalls.

Comment 20 Fabio Olive Leite 2008-11-06 20:16:17 UTC

Actually, spoke a little too soon. 2.6.18-92.el5 has that.

Comment 21 Sadique Puthen 2008-11-07 10:33:29 UTC

I tried to reproduce this with F9 and this panic is not reproducible there. When tickets are expired, writing to the nfs mount simply gets stopped without any panic.

Linux hostname 2.6.25-2.fc9.x86_64.xen #1 Fri Apr 25 13:13:13 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

So the fix is already upstream, I think.

Comment 24 Norman Elton 2008-11-19 14:27:49 UTC

Any update on this bug? If there's any chance of getting a test kernel or a hotfix, I'd be willing to try it out!

Thanks

Norman

Comment 29 Jeff Layton 2008-12-19 12:31:17 UTC

Regarding comment #18, did you happen to have memory poisoning enabled when you tested that? It would be interesting to run the same reproducer on a kernel-debug, that might give a better indication of what's actually happening to the corrupt gss_auth struct.

Comment 36 Jeff Layton 2009-01-06 18:11:53 UTC

FWIW, I tried reproducing this:

1) Set up a client to mount a krb5 mount
2) had an unprivileged user get a krb5 ticket with a 5m timeout
3) dd'ed 1GB of /dev/zero to a file on the mount as that user
4) deleted the file
5) slept 5s and goto 3

...the box did not crash. After the ticket expired, I got:

rm: cannot remove `/mnt/krb5/testuser/testfile': Permission denied
dd: opening `/mnt/krb5/testuser/testfile': Permission denied

...and messages like this in the ring buffer:

call_verify: auth check failed
call_verify: auth check failed
call_verify: auth check failed
call_verify: auth check failed
call_verify: auth check failed
call_verify: auth check failed
call_verify: auth check failed
call_verify: auth check failed
RPC call_verify: retry failed, exit EIO
call_verify: auth check failed
RPC call_verify: retry failed, exit EIO
call_verify: auth check failed
call_verify: auth check failed
RPC call_verify: retry failed, exit EIO
call_verify: auth check failed
RPC call_verify: retry failed, exit EIO

...this host was running kernel-debug-2.6.18-128.el5.jtltest.57.x86_64 from my people.redhat.com page. Is there something else I should be doing to reproduce this? If that sounds like how you're reproducing it, it might be interesting to try a 5.3-ish kernel and see if this is still reproducible there.

Comment 42 Jeff Layton 2009-01-07 14:34:18 UTC

I don't see any evidence of poisoning patterns in the corrupt gss_auth struct. I think we need to step back and see whether some earlier pointers in the chain might be bad as well. The structs involved here are pretty convoluted but, the rpc_task struct looks fine:

crash> struct rpc_task.tk_magic ffff81001656fe20
  tk_magic = 0xf00baa, 

The rpc_clnt attached to that struct also looks OK AFAICT.

...and the rpc_auth struct also looks valid:

crash> struct rpc_auth 0xffffffff88399520
struct rpc_auth {
  au_cslack = 0x1d, 
  au_rslack = 0x3f, 
  au_verfsize = 0xc, 
  au_flags = 0x0, 
  au_ops = 0xffffffff88399480, 
  au_flavor = 0x1, 
  au_count = {
    counter = 0x1
  }, 
  au_credcache = 0xffffffff88399560
}

...but the au_flavor is RPC_AUTH_UNIX. I think it should be RPC_AUTH_GSS (0x6) instead. The gss_auth struct we're working with is determined from the container of this struct:

        struct gss_auth *gss_auth = container_of(task->tk_client->cl_auth,
                        struct gss_auth, rpc_auth);

...so it doesn't seem to surprising that it looks bogus once we start looking at it. It might be interesting to add some BUG_ON's to sanity check the au_flavor.

Comment 43 Jeff Layton 2009-01-07 15:15:38 UTC

RPC credops attached to tk_msg are definitely AUTH_GSS ones:

crash> struct rpc_credops 0xffffffff88406b80
struct rpc_credops {
  cr_name = 0xffffffff88401e92 "AUTH_GSS", 
  cr_init = 0xffffffff883fd759 <gss_cred_init>, 
  crdestroy = 0xffffffff883fdc21 <gss_destroy_cred>, 
  crmatch = 0xffffffff883fd000 <init_rpcsec_gss>, 
  crmarshal = 0xffffffff883fdc5d <gss_marshal>, 
  crrefresh = 0xffffffff883fd8ad <gss_refresh>, 
  crvalidate = 0xffffffff883fddef <gss_validate>, 
  crwrap_req = 0xffffffff883fdf47 <gss_wrap_req>, 
  crunwrap_resp = 0xffffffff883fd284 <gss_unwrap_resp>
}

...so to summarize, we have:

rpc_task.tk_msg.rpc_cred.cr_ops which points to auth_gss ops

...and...

rpc_task->tk_client->cl_auth which points to an auth_unix rpc_auth struct

...I think this is the crux of the problem but am not yet sure how it ends up this way.

Comment 44 Fabio Olive Leite 2009-01-07 17:22:45 UTC

Jeff,

Earlier you mentioned this patch:
    SUNRPC: Ensure that the struct gss_auth lifetime exceeds the credential's

Maybe the auth_gss structure is being cleaned up and reused as an auth_unix struct while rpc_task->tk_client still had a pointer to it trusting on the old values? I do not know if the auth_* structures are allocated out of a common cache or not, but it does look like use-after-free, specially when there is a probable refcount/lifetime issue in the code.

Cheers,
Fabio Olive

Comment 45 Jeff Layton 2009-01-07 17:45:56 UTC

The cl_auth gets set in rpcauth_create. What seems to be happening is that we're creating 2 rpc_auth structs for 2 different tk_client's. When the delayed work to renew the cred is being done, it looks like it has the "wrong" tk_client.

Here's a bit of debugging from rpcauth_create that shows the pseudoflavor, rpc_clnt address and rpc_auth addr, plus a stack trace of how it got there. Given that this should be a krb5p mount, it's not clear to me why we're creating an rpc_clnt with a pseudoflavor of 1 here.

rpcauth_create: pseudoflavor=1 rpc_clnt=ffff8100115f1118 auth=ffffffff88394820

Call Trace:
 [<ffffffff88379ad2>] :sunrpc:rpcauth_create+0x8d/0x95
 [<ffffffff88374235>] :sunrpc:rpc_new_client+0x1d9/0x299
 [<ffffffff80159cc3>] __spin_lock_init+0x29/0x47
 [<ffffffff8837464d>] :sunrpc:rpc_create_client+0xb/0x3c
 [<ffffffff884c4ba0>] :nfs:nfs_create_rpc_client+0x11b/0x15c
 [<ffffffff884c5182>] :nfs:nfs4_set_client+0x9f/0x174
 [<ffffffff884c58d7>] :nfs:nfs4_create_server+0x7a/0x394
 [<ffffffff884cc778>] :nfs:nfs_copy_user_string+0x3c/0x89
 [<ffffffff80013216>] poison_obj+0x26/0x2f
 [<ffffffff884cc778>] :nfs:nfs_copy_user_string+0x3c/0x89
 [<ffffffff8000cc34>] cache_alloc_debugcheck_after+0xef/0x1c1
 [<ffffffff800e2443>] __kmalloc+0x11d/0x129
 [<ffffffff884cd0d4>] :nfs:nfs4_get_sb+0x1fc/0x323
 [<ffffffff800d33fe>] zone_statistics+0x3e/0x6d
 [<ffffffff8000ac30>] get_page_from_freelist+0x3ad/0x450
 [<ffffffff800eaeb2>] vfs_kern_mount+0x93/0x11a
 [<ffffffff800eaf7b>] do_kern_mount+0x36/0x4d
 [<ffffffff800f5311>] do_mount+0x6af/0x71f
 [<ffffffff800f3ed3>] copy_mount_options+0xce/0x127
 [<ffffffff800a0cbb>] search_exception_tables+0x1d/0x2d
 [<ffffffff8006acce>] do_page_fault+0x650/0x835
 [<ffffffff800d33fe>] zone_statistics+0x3e/0x6d
 [<ffffffff8000aab5>] get_page_from_freelist+0x232/0x450
 [<ffffffff8000f9c1>] __alloc_pages+0x76/0x2df
 [<ffffffff8004e5b9>] sys_mount+0x8a/0xcd
 [<ffffffff800602a6>] tracesys+0xd5/0xdf

rpcauth_create: pseudoflavor=390005 rpc_clnt=ffff8100115f1330 auth=ffff810017df0b60

Call Trace:
 [<ffffffff88379ad2>] :sunrpc:rpcauth_create+0x8d/0x95
 [<ffffffff884c4332>] :nfs:nfs_init_server_rpcclient+0x5e/0xc0
 [<ffffffff884c5a1e>] :nfs:nfs4_create_server+0x1c1/0x394
 [<ffffffff884cc778>] :nfs:nfs_copy_user_string+0x3c/0x89
 [<ffffffff80013216>] poison_obj+0x26/0x2f
 [<ffffffff884cc778>] :nfs:nfs_copy_user_string+0x3c/0x89
 [<ffffffff8000cc34>] cache_alloc_debugcheck_after+0xef/0x1c1
 [<ffffffff800e2443>] __kmalloc+0x11d/0x129
 [<ffffffff884cd0d4>] :nfs:nfs4_get_sb+0x1fc/0x323
 [<ffffffff800d33fe>] zone_statistics+0x3e/0x6d
 [<ffffffff8000ac30>] get_page_from_freelist+0x3ad/0x450
 [<ffffffff800eaeb2>] vfs_kern_mount+0x93/0x11a
 [<ffffffff800eaf7b>] do_kern_mount+0x36/0x4d
 [<ffffffff800f5311>] do_mount+0x6af/0x71f
 [<ffffffff800f3ed3>] copy_mount_options+0xce/0x127
 [<ffffffff800a0cbb>] search_exception_tables+0x1d/0x2d
 [<ffffffff8006acce>] do_page_fault+0x650/0x835
 [<ffffffff800d33fe>] zone_statistics+0x3e/0x6d
 [<ffffffff8000aab5>] get_page_from_freelist+0x232/0x450
 [<ffffffff8000f9c1>] __alloc_pages+0x76/0x2df
 [<ffffffff8004e5b9>] sys_mount+0x8a/0xcd
 [<ffffffff800602a6>] tracesys+0xd5/0xdf

Comment 46 Jeff Layton 2009-01-07 17:52:59 UTC

I don't think that's the case. I instrumented rpcauth_create and see these calls:

rpcauth_create: pseudoflavor=1 rpc_clnt=ffff8100115f1978 auth=ffffffff88394820
rpcauth_create: pseudoflavor=1 rpc_clnt=ffff8100115f1118 auth=ffffffff88394820
rpcauth_create: pseudoflavor=1 rpc_clnt=ffff8100115f1118 auth=ffffffff88394820
rpcauth_create: pseudoflavor=390005 rpc_clnt=ffff8100115f1330 auth=ffff810017df0b60

...the problem with that theory is that I'm not seeing that auth struct (pseudoflavor 390005) get reallocated as an AUTH_UNIX struct. I don't think this is a use-after-free problem.

Comment 47 Jeff Layton 2009-01-08 14:45:23 UTC

To summarize the reproducer:

1) mount up an export on a RHEL5 client using sec=krb5p (regular krb5 or krb5i may also work, I'm not sure)

2) become an unprivileged user that has write permissions on the mount

3) get a krb5 ticket as that user with a 10m timeout:

   $ kinit -l 10m

4) run this loop as the unprivileged user (substitute paths where appropriate)

while true; do
    dd if=/dev/zero of=/mnt/krb5/testuser/testfile bs=1MB count=512
    rm -f /mnt/krb5/testuser/testfile
    sleep 5
done

...eventually the krb5 ticket will time out and a little while later the box will crash. I've tried using shorter krb5 ticket timeouts but it didn't seem to reproduce, so there is some timing involved.

Comment 48 Jeff Layton 2009-01-08 18:39:54 UTC

A little more info. I did a bit more instrumentation and got a new core. This core paniced with rpc_task == 0xffff8100131c7060. That struct had the common situation of Some debug messages:

rpcauth_bindcred: task=ffff8100131c7060 au_name=UNIX cr_name=AUTH_UNIX tk_client=ffff81000a7b22b0
rpcauth_bindcred: task=ffff8100131c7060 au_name=UNIX cr_name=AUTH_UNIX tk_client=ffff81000a7b2098
rpcauth_bindcred: task=ffff8100131c7060 au_name=RPCSEC_GSS cr_name=AUTH_GSS tk_client=ffff81000bb5dd68

I think this means that the rpc_task was recycled a couple of times. When rpcauth_bindcred was called, tk_client was at 0xffff81000bb5dd68, but in the core, it's set to 0xffff81000a7b2098. It looks like the tk_client gets set in rpc_init_task, so I'm going to instrument that next...

Comment 49 Jeff Layton 2009-01-09 12:39:29 UTC

This problem is also reproducible with sec=krb5.

Comment 50 Jeff Layton 2009-01-09 12:46:31 UTC

The interesting bit is that the stack trace is always in the state renewal code:

Process events/0 (pid: 8, threadinfo ffff81001fccc000, task ffff81001f7fe240)
Stack:  ffff81001fba12f0 ffffffff883fb6f0 ffff81001a3b6858 ffff81001a3b6858
 ffff81001367e060 ffff81001a3b6858 ffffffff88397520 ffffffff883fb927
 ffff81001367e060 ffff81001a3b6858 0000000000000000 ffffffff884ed000
Call Trace:
 [<ffffffff883fb6f0>] :auth_rpcgss:gss_setup_upcall+0x93/0xfc
 [<ffffffff883fb927>] :auth_rpcgss:gss_refresh+0x7a/0x139
 [<ffffffff883794eb>] :sunrpc:rpcauth_refreshcred+0x48/0x53
 [<ffffffff88378d82>] :sunrpc:__rpc_execute+0x92/0x250
 [<ffffffff88373bb0>] :sunrpc:rpc_call_async+0x6f/0xa3
 [<ffffffff884e02e0>] :nfs:nfs4_proc_async_renew+0x45/0x4a
 [<ffffffff884e8728>] :nfs:nfs4_renew_state+0xda/0x1c2
 [<ffffffff8004f86f>] run_workqueue+0x9a/0xf4
 [<ffffffff8004c087>] worker_thread+0x0/0x122
 [<ffffffff8004c177>] worker_thread+0xf0/0x122
 [<ffffffff8008ebaf>] default_wake_function+0x0/0xe
 [<ffffffff800a7008>] trace_hardirqs_on+0x11b/0x13f
 [<ffffffff80034306>] kthread+0xfe/0x132
 [<ffffffff8006808c>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff8006898d>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff80034208>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11

nfs4_renew_state calls this:
                cred = nfs4_get_renew_cred(clp);

...and then this:
                nfs4_proc_async_renew(clp, cred);

...nfs4_proc_async_renew then sets up an rpc call with the cred that it found and the rpc_clnt attached to the nfs_client. The mismatch comes in at this point, I think. It seems like nfs4_get_renew_cred is returning GSSAPI credentials even though the rpc_clnt has AUTH_UNIX ops.

Comment 51 Jeff Layton 2009-01-09 14:00:06 UTC

Created attachment 328546 [details]
patch -- create rpc clients with proper auth flavor

This patch seems to fix the problem and loosely matches how this code was changed upstream. It seems to work and I haven't seen any regressions yet, but it needs more testing. I'm not terribly familiar with the intricacies of the RPC auth code.

Comments appreciated...

Comment 52 Norman Elton 2009-01-09 14:22:55 UTC

If you have a test kernel, I'd be happy to try it out on our end. Thanks for your help!

Comment 54 Jeff Layton 2009-01-10 00:33:32 UTC

The latest RHEL5 test kernels on my people.redhat.com page should have this patch:

http://people.redhat.com/jlayton

If you're able to test them someplace non-critical, then please report back whether they fix the problem...

Comment 55 Norman Elton 2009-01-12 17:06:38 UTC

Good news! It works like a champ. Thanks for your help on this.

Comment 56 Jeff Layton 2009-01-12 17:47:04 UTC

Thanks for testing it. I still want to do a bit more testing, but I'm 90% certain that it's good at this point.

Comment 60 Don Zickus 2009-01-27 16:01:46 UTC

in kernel-2.6.18-129.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 63 RHEL Program Management 2009-02-16 15:14:05 UTC

Updating PM score.

Comment 67 errata-xmlrpc 2009-09-02 08:07:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.