Easily reproducable kernel panic -- mount a nfsv4 filesystem, and then attempt to mount it again. For instance, run this twice: # mount -t nfs4 server:/ /mnt/server Oops looks like this: Unable to handle kernel paging request at ffffffffffffffff RIP: <ffffffff8015bd5a>{free_percpu+24} PML4 103067 PGD 1727067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: nfs lockd nfs_acl md5 ipv6 autofs4 rpcsec_gss_krb5 auth_rpcgss des sunrpc loop xennet dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod scsi_mod Pid: 2133, comm: mount Not tainted 2.6.9-48.EL.mntcrash.1xenU RIP: e030:[<ffffffff8015bd5a>] <ffffffff8015bd5a>{free_percpu+24} RSP: e02b:ffffff801b7d5c18 EFLAGS: 00010286 RAX: 00000000ffffffff RBX: ffffffffffffffff RCX: ffffff801fe2be00 RDX: ffffff8001000000 RSI: 0000000000000042 RDI: 0000000000000000 RBP: 0000000000000000 R08: 00000000c43910ac R09: ffffff801fe2be00 R10: ffffff801fe2be00 R11: ffffff801fe2be00 R12: 0000000000000000 R13: ffffff801b037000 R14: ffffffffa019a1e0 R15: ffffff801b029000 FS: 0000002a95573b00(0000) GS:ffffffff8041d700(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process mount (pid: 2133, threadinfo ffffff801b7d4000, task ffffff801bdab030) Stack: ffffffffff5fd000 ffffff801fe2be00 ffffffffa019a1e0 ffffffffa016561c ffffff801fe03400 ffffffff801ce4de 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace:<ffffffffa016561c>{:nfs:nfs4_get_sb+1759} <ffffffff801ce4de>{selinux_sb_copy_data+47} <ffffffff8017b02d>{do_kern_mount+161} <ffffffff80190d33>{do_mount+1690} <ffffffff802360c9>{sock_common_recvmsg+48} <ffffffff80232d8a>{sock_aio_read+297} <ffffffff80265895>{tcp_transmit_skb+2037} <ffffffff80235e1f>{sk_reset_timer+15} <ffffffff802663b0>{tcp_write_xmit+314} <ffffffff80156ff4>{buffered_rmqueue+384} <ffffffff8010ddc3>{error_exit+0} <ffffffff801571e4>{__alloc_pages+200} <ffffffff801910d6>{sys_mount+186} <ffffffff8010d66e>{system_call+134} <ffffffff8010d5e8>{system_call+0} Code: 48 8b 3b e8 9d f4 ff ff ff c5 48 83 c3 08 83 fd 1f 7e e0 58 RIP <ffffffff8015bd5a>{free_percpu+24} RSP <ffffff801b7d5c18> CR2: ffffffffffffffff <0>Kernel panic - not syncing: Oops reproduced so far on an x86_64 xen guest running a -48.EL kernel with the patch for bz 226983. Not certain yet if other arches are affected.
To clarify, I've also seen the same panic on a stock -48 xenU kernel. I just tried the patch in 226983 to see if it might fix this as well, but it didn't.
Same panic on i686 xen guest as well: general protection fault: 0000 [#1] SMP Modules linked in: nfs lockd nfs_acl md5 ipv6 autofs4 sunrpc dm_mirror dm_mod xennet ext3 jbd xenblk sd_mod scsi_mod CPU: 0 EIP: 0061:[<c01424a3>] Not tainted VLI EFLAGS: 00010286 (2.6.9-48.ELxenU) EIP is at free_percpu+0x17/0x29 eax: ffffffff ebx: 00000000 ecx: df357000 edx: f5392000 esi: ffffffff edi: c1627200 ebp: dcf621a0 esp: df357e98 ds: 007b es: 007b ss: 0068 Process mount (pid: 2556, threadinfo=df357000 task=de933970) Stack: c16244f8 c1624400 e1231c45 00000000 dceef000 00000000 c1630980 e125f7c0 c015eb35 e125f7c0 00000000 dceef000 dcf16000 dded5000 dd2bc000 dcf16000 00000015 dceef000 c0172c9d dd2bc000 00000000 dceef000 dcf16000 00000000 Call Trace: [<e1231c45>] nfs4_get_sb+0x265/0x275 [nfs] [<c015eb35>] do_kern_mount+0x85/0x143 [<c0172c9d>] do_new_mount+0x67/0xa4 [<c01732ea>] do_mount+0x15f/0x179 [<c0107507>] error_code+0x2b/0x30 [<c026a298>] iret_exc+0xeb4/0x159c [<c0173140>] copy_mount_options+0x49/0x94 [<c0173655>] sys_mount+0x9b/0x115 [<c010737f>] syscall_call+0x7/0xb Code: 0e 80 3a 00 74 09 5b 5e 5f 5d e9 b1 4b 0b 00 5b 5e 5f 5d c3 56 53 8b 74 24 0c 31 db f7 d6 0f a3 1d 24 59 3a c0 19 c0 85 c0 74 09 <ff> 34 9e e8 55 ff ff ff 59 43 83 fb 1f 7e e4 5b 5e c3 8b 44 24 <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception It looks like -42.26 does not panic on i686, so my guess is that this is a regression introduced somewhere between those two releases. I'll see if I can confirm when it was introduced.
Looks like this was introduced in -42.27. The most likely culprit is the nfs-stats patch detailed here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199263 I'm going to try backing this out and seeing if it fixes the problem.
Created attachment 148684 [details] patch to check for NULL pointer before freeing This patch corrected the oops. nfs4_get_sb needs to check if server->io_stats is NULL before trying to free it.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Created attachment 148816 [details] backported upstream patch for same problem Actually, this patch, backported from here might be better: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=01d0ae8beaee75d954900109619b700fe68707d9 This looks like it fixes the original problem and also addresses Peter's concerns. In addition to what was in the upstream patch, I also added a call to the error path of nfs_sb_init. It looked like if getting a root inode or dentry failed then the iostats would leak.
Created attachment 148838 [details] updated patch -- remove added nfs_free_iostats that would have caused double-free Peter pointed out that that nfs_free_iostats that I added could cause a double free, since kill_sb gets called in an error condition anyway. This patch gets rid of that and should be pretty much the same as what the upstream patch was.
Created attachment 149372 [details] patch -- don't free NULL pointer on error, also dont leak iostats This patch should fix the problem as well, and doesn't pull in the changes to nfs_sb_init.
committed in stream U5 build 50. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
Patch is in -52, already verified by at least one partner.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html