Description of problem: When a clustered NFS resource was relocated to another node, the node it was on paniced. Version-Release number of selected component (if applicable): GFS-6.1.6-1 GFS-kernel-hugemem-2.6.9-55.0 kernel-hugemem-2.6.9-37.EL nfs-utils-1.0.6-70.EL4 How reproducible: Unknown Steps to Reproduce: 1. Create an NFS server on top of a GFS file system with rgmanager 2. Start a load on the service 3. relocate the service Actual results: Unable to handle kernel NULL pointer dereference at virtual address 00000098 printing eip: 82bf8c82 *pde = 00004001 Oops: 0000 [#1] SMP Modules linked in: nfsd exportfs lockd nfs_acl lock_dlm(U) gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) parport_pc lp parport autofs4 i2c_dev i2c_core md5 ipv6 sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<82bf8c82>] Not tainted VLI EFLAGS: 00010202 (2.6.9-37.ELhugemem) EIP is at gfs_fsync+0xc/0x9a [gfs] eax: 00000000 ebx: 09dff360 ecx: 00000000 edx: 09dff360 esi: 82bf8c76 edi: 04742344 ebp: 00000000 esp: 215cce8c ds: 007b es: 007b ss: 0068 Process nfsd (pid: 8016, threadinfo=215cc000 task=7e7c4230) Stack: 00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 215cceb4 00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003 Call Trace: [<82bf8c76>] gfs_fsync+0x0/0x9a [gfs] [<82c55aab>] nfsd_sync_dir+0x2c/0x3a [nfsd] [<82c563f5>] nfsd_create+0x2f2/0x35c [nfsd] [<82c5d75f>] nfsd3_proc_mkdir+0xbe/0xc8 [nfsd] [<82c5fcbd>] nfs3svc_decode_mkdirargs+0x0/0x41e [nfsd] [<82c5fcbd>] nfs3svc_decode_mkdirargs+0x0/0x41e [nfsd] [<82c52681>] nfsd_dispatch+0xba/0x16d [nfsd] [<82ae0603>] svc_process+0x432/0x6d7 [sunrpc] [<82c5245a>] nfsd+0x1cc/0x339 [nfsd] [<82c5228e>] nfsd+0x0/0x339 [nfsd] [<021041f5>] kernel_thread_helper+0x5/0xb Code: <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():0[expected: 0], irqs_disabled():1 [<02120209>] __might_sleep+0x7d/0x88 [<02155350>] rw_vm+0xe4/0x29c [<82bf8c57>] gfs_close+0x33/0x52 [gfs] [<82bf8c57>] gfs_close+0x33/0x52 [gfs] [<021557c7>] get_user_size+0x30/0x57 [<82bf8c57>] gfs_close+0x33/0x52 [gfs] [<021061a7>] show_registers+0x115/0x16c [<0210633e>] die+0xdb/0x16b [<02122a14>] vprintk+0x136/0x14a [<0211b236>] do_page_fault+0x421/0x5f7 [<82bf8c82>] gfs_fsync+0xc/0x9a [gfs] [<022ca7e1>] __cond_resched+0x14/0x39 [<022ca259>] wait_for_completion+0xc3/0xcb [<0211e9b8>] complete+0x2b/0x3d [<82be30c5>] lock_on_glock+0x64/0x6a [gfs] [<022ca7e1>] __cond_resched+0x14/0x39 [<0216f90e>] alloc_inode+0xf6/0x179 [<0211ae15>] do_page_fault+0x0/0x5f7 [<82bf8c76>] gfs_fsync+0x0/0x9a [gfs] [<82bf8c82>] gfs_fsync+0xc/0x9a [gfs] [<82bf8c76>] gfs_fsync+0x0/0x9a [gfs] [<82c55aab>] nfsd_sync_dir+0x2c/0x3a [nfsd] [<82c563f5>] nfsd_create+0x2f2/0x35c [nfsd] [<82c5d75f>] nfsd3_proc_mkdir+0xbe/0xc8 [nfsd] [<82c5fcbd>] nfs3svc_decode_mkdirargs+0x0/0x41e [nfsd] [<82c5fcbd>] nfs3svc_decode_mkdirargs+0x0/0x41e [nfsd] [<82c52681>] nfsd_dispatch+0xba/0x16d [nfsd] [<82ae0603>] svc_process+0x432/0x6d7 [sunrpc] [<82c5245a>] nfsd+0x1cc/0x339 [nfsd] [<82c5228e>] nfsd+0x0/0x339 [nfsd] [<021041f5>] kernel_thread_helper+0x5/0xb Bad EIP value. <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception Expected results: service should relocate to the other node and nothing should panic. Additional info:
I just hit this same issue, on an ia64 4 node cluster. No service relocation was taking place, just IO to NFS service. 1) Create service (GFS fs exproted via NFS) 2) mount on client node 3) Start IO Load Unable to handle kernel NULL pointer dereference (address 00000000000000f0) nfsd[13682]: Oops 11012296146944 [1] Modules linked in: nfsd exportfs lockd nfs_acl lock_dlm(U) gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat button ohci_hcd ehci_hcd e100 mii tg3 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod lpfc scsi_transport_fc mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod Pid: 13682, CPU 0, comm: nfsd psr : 0000121008126010 ifs : 800000000000038b ip : [<a00000020077eba1>] Not tainted ip is at gfs_fsync+0x21/0x200 [gfs] unat: 0000000000000000 pfs : 0000000000000309 rsc : 0000000000000003 rnat: e00000000b24fc00 bsps: e00000000b24fc00 pr : 0000000000009a81 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000002008a9470 b6 : a00000020077eb80 b7 : a000000100203420 f6 : 1003e6000000d0f655180 f7 : 1003ee00000003f153b80 f8 : 1003e0000000000000035 f9 : 10002f000000000000000 f10 : 0ffffaaaaaaaaa9574a00 f11 : 1003e0000000000000001 r1 : a000000200918000 r2 : e00000000b24fd98 r3 : a00000020077eb80 r8 : 0000000000000000 r9 : e000000006079988 r10 : 0000000000000000 r11 : e00000000b24fde0 r12 : e00000000b24fd50 r13 : e00000000b248000 r14 : a0000002007c6540 r15 : e000000006079af0 r16 : e000000039374950 r17 : 0000000000000000 r18 : e00000000b24fdd8 r19 : e00000000b24fde8 r20 : e00000000b24fdf0 r21 : e00000000b24fdf8 r22 : e00000000b24fde0 r23 : 0000000000000000 r24 : e00000000b24fdd0 r25 : 0000000000000001 r26 : e00000000b24fdc8 r27 : 0000000000000000 r28 : e00000000b24fdc0 r29 : 0000000000000000 r30 : e000000039374940 r31 : 0000000000000000 Call Trace: [<a000000100016da0>] show_stack+0x80/0xa0 sp=e00000000b24f8e0 bsp=e00000000b2491f0 [<a0000001000176b0>] show_regs+0x890/0x8c0 sp=e00000000b24fab0 bsp=e00000000b2491a8 [<a00000010003e8f0>] die+0x150/0x240 sp=e00000000b24fad0 bsp=e00000000b249168 [<a000000100064440>] ia64_do_page_fault+0x8c0/0xbc0 sp=e00000000b24fad0 bsp=e00000000b249100 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260 sp=e00000000b24fb80 bsp=e00000000b249100 [<a00000020077eba0>] gfs_fsync+0x20/0x200 [gfs] sp=e00000000b24fd50 bsp=e00000000b2490a8 [<a0000002008a9470>] nfsd_sync_dir+0xb0/0x100 [nfsd] sp=e00000000b24fdf0 bsp=e00000000b249078 [<a0000002008b0f00>] nfsd_create+0x700/0x940 [nfsd] sp=e00000000b24fdf0 bsp=e00000000b248fe8 [<a0000002008c4b40>] nfsd3_proc_mkdir+0x1a0/0x220 [nfsd] sp=e00000000b24fdf0 bsp=e00000000b248f90 [<a00000020089f820>] nfsd_dispatch+0x340/0x600 [nfsd] sp=e00000000b24fdf0 bsp=e00000000b248f38 [<a0000002004654d0>] svc_process+0x1630/0x1880 [sunrpc] sp=e00000000b24fdf0 bsp=e00000000b248ec0 [<a00000020089efb0>] nfsd+0x490/0x9c0 [nfsd] sp=e00000000b24fe00 bsp=e00000000b248e38 [<a000000100018c70>] kernel_thread_helper+0x30/0x60 sp=e00000000b24fe30 bsp=e00000000b248e10 [<a000000100008c60>] start_kernel_thread+0x20/0x40 sp=e00000000b24fe30 bsp=e00000000b248e10 Kernel panic - not syncing: Fatal exception
After further investigation, it appears that the panic is caused by a mkdir in our tools on an NFS client when trying to start the NFS load. Relocation had not been attempted when the panic happened.
This is what happens... RHEL4 defaults all exports with "sync" option .. so nfsd_create() goes to: if (EX_ISSYNC(fhp->fh_export)) { nfsd_sync_dir(dentry); write_inode_now(dchild->d_inode, 1); } And nfsd_sync_dir() calls nfsd_dosync with filep set to NULL: void nfsd_sync_dir(struct dentry *dp) { nfsd_dosync(NULL, dp, dp->d_inode->i_fop); } so nfsd_dosync() passes gfs_sync() a NULL filep: inline void nfsd_dosync(struct file *filp, struct dentry *dp, struct file_operations *fop) { struct inode *inode = dp->d_inode; int (*fsync) (struct file *, struct dentry *, int); filemap_fdatawrite(inode->i_mapping); if (fop && (fsync = fop->fsync)) fsync(filp, dp, 0); filemap_fdatawait(inode->i_mapping); } And I used filp to get the mapping pointer.... ok, fix is on the way.
Code checked into CVS. Please re-try.
This bug isn't interfering with our testing anymore.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0561.html