Bug 193817

Summary:	panic on mkdir on NFS on GFS
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Nate Straz <nstraz>
Component:	gfs	Assignee:	Wendy Cheng <nobody+wcheng>
Status:	CLOSED ERRATA	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2006-0561	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-08-10 21:35:48 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	180185

Description Nate Straz 2006-06-01 20:24:11 UTC

Description of problem:

When a clustered NFS resource was relocated to another node, the node
it was on paniced.

Version-Release number of selected component (if applicable):
GFS-6.1.6-1
GFS-kernel-hugemem-2.6.9-55.0
kernel-hugemem-2.6.9-37.EL
nfs-utils-1.0.6-70.EL4


How reproducible:
Unknown

Steps to Reproduce:
1. Create an NFS server on top of a GFS file system with rgmanager
2. Start a load on the service
3. relocate the service
  
Actual results:
Unable to handle kernel NULL pointer dereference at virtual address 00000098
 printing eip:                                                              
82bf8c82      
*pde = 00004001
Oops: 0000 [#1]
SMP            
Modules linked in: nfsd exportfs lockd nfs_acl lock_dlm(U) gnbd(U)
lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) parport_pc lp parport
autofs4 i2c_dev i2c_core md5 ipv6 sunrpc button battery ac uhci_hcd hw_random
e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<82bf8c82>]    Not tainted VLI
EFLAGS: 00010202   (2.6.9-37.ELhugemem)     
EIP is at gfs_fsync+0xc/0x9a [gfs]      
eax: 00000000   ebx: 09dff360   ecx: 00000000   edx: 09dff360
esi: 82bf8c76   edi: 04742344   ebp: 00000000   esp: 215cce8c
ds: 007b   es: 007b   ss: 0068                               
Process nfsd (pid: 8016, threadinfo=215cc000 task=7e7c4230)
Stack: 00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000000 
       00000000 00000000 00000000 215cceb4 00000000 00000000 00000001 00000000 
       00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003 
Call Trace:                                                                    
 [<82bf8c76>] gfs_fsync+0x0/0x9a [gfs]
 [<82c55aab>] nfsd_sync_dir+0x2c/0x3a [nfsd]
 [<82c563f5>] nfsd_create+0x2f2/0x35c [nfsd]
 [<82c5d75f>] nfsd3_proc_mkdir+0xbe/0xc8 [nfsd]
 [<82c5fcbd>] nfs3svc_decode_mkdirargs+0x0/0x41e [nfsd]
 [<82c5fcbd>] nfs3svc_decode_mkdirargs+0x0/0x41e [nfsd]
 [<82c52681>] nfsd_dispatch+0xba/0x16d [nfsd]          
 [<82ae0603>] svc_process+0x432/0x6d7 [sunrpc]
 [<82c5245a>] nfsd+0x1cc/0x339 [nfsd]         
 [<82c5228e>] nfsd+0x0/0x339 [nfsd]  
 [<021041f5>] kernel_thread_helper+0x5/0xb
Code: <3>Debug: sleeping function called from invalid context at
include/linux/rwsem.h:43
in_atomic():0[expected: 0], irqs_disabled():1
 [<02120209>] __might_sleep+0x7d/0x88        
 [<02155350>] rw_vm+0xe4/0x29c       
 [<82bf8c57>] gfs_close+0x33/0x52 [gfs]
 [<82bf8c57>] gfs_close+0x33/0x52 [gfs]
 [<021557c7>] get_user_size+0x30/0x57  
 [<82bf8c57>] gfs_close+0x33/0x52 [gfs]
 [<021061a7>] show_registers+0x115/0x16c
 [<0210633e>] die+0xdb/0x16b            
 [<02122a14>] vprintk+0x136/0x14a
 [<0211b236>] do_page_fault+0x421/0x5f7
 [<82bf8c82>] gfs_fsync+0xc/0x9a [gfs] 
 [<022ca7e1>] __cond_resched+0x14/0x39
 [<022ca259>] wait_for_completion+0xc3/0xcb
 [<0211e9b8>] complete+0x2b/0x3d           
 [<82be30c5>] lock_on_glock+0x64/0x6a [gfs]
 [<022ca7e1>] __cond_resched+0x14/0x39     
 [<0216f90e>] alloc_inode+0xf6/0x179  
 [<0211ae15>] do_page_fault+0x0/0x5f7
 [<82bf8c76>] gfs_fsync+0x0/0x9a [gfs]
 [<82bf8c82>] gfs_fsync+0xc/0x9a [gfs]
 [<82bf8c76>] gfs_fsync+0x0/0x9a [gfs]
 [<82c55aab>] nfsd_sync_dir+0x2c/0x3a [nfsd]
 [<82c563f5>] nfsd_create+0x2f2/0x35c [nfsd]
 [<82c5d75f>] nfsd3_proc_mkdir+0xbe/0xc8 [nfsd]
 [<82c5fcbd>] nfs3svc_decode_mkdirargs+0x0/0x41e [nfsd]
 [<82c5fcbd>] nfs3svc_decode_mkdirargs+0x0/0x41e [nfsd]
 [<82c52681>] nfsd_dispatch+0xba/0x16d [nfsd]          
 [<82ae0603>] svc_process+0x432/0x6d7 [sunrpc]
 [<82c5245a>] nfsd+0x1cc/0x339 [nfsd]         
 [<82c5228e>] nfsd+0x0/0x339 [nfsd]  
 [<021041f5>] kernel_thread_helper+0x5/0xb
 Bad EIP value.                           
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception



Expected results:
service should relocate to the other node and nothing should panic.

Additional info:

Comment 1 Dean Jansa 2006-06-01 20:47:07 UTC

I just hit this same issue, on an ia64 4 node cluster.  No service relocation
was taking place, just IO to NFS service.

1) Create service  (GFS fs exproted via NFS)
2) mount on client node
3) Start IO Load


Unable to handle kernel NULL pointer dereference (address 00000000000000f0)
nfsd[13682]: Oops 11012296146944 [1]                                       
Modules linked in: nfsd exportfs lockd nfs_acl lock_dlm(U) gnbd(U)
lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp
parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat button ohci_hcd
ehci_hcd e100 mii tg3 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod lpfc
scsi_transport_fc mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod
                                                  
Pid: 13682, CPU 0, comm:                 nfsd
psr : 0000121008126010 ifs : 800000000000038b ip  : [<a00000020077eba1>]    Not
tainted
ip is at gfs_fsync+0x21/0x200 [gfs]
unat: 0000000000000000 pfs : 0000000000000309 rsc : 0000000000000003
rnat: e00000000b24fc00 bsps: e00000000b24fc00 pr  : 0000000000009a81
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000                       
b0  : a0000002008a9470 b6  : a00000020077eb80 b7  : a000000100203420
f6  : 1003e6000000d0f655180 f7  : 1003ee00000003f153b80             
f8  : 1003e0000000000000035 f9  : 10002f000000000000000
f10 : 0ffffaaaaaaaaa9574a00 f11 : 1003e0000000000000001
r1  : a000000200918000 r2  : e00000000b24fd98 r3  : a00000020077eb80
r8  : 0000000000000000 r9  : e000000006079988 r10 : 0000000000000000
r11 : e00000000b24fde0 r12 : e00000000b24fd50 r13 : e00000000b248000
r14 : a0000002007c6540 r15 : e000000006079af0 r16 : e000000039374950
r17 : 0000000000000000 r18 : e00000000b24fdd8 r19 : e00000000b24fde8
r20 : e00000000b24fdf0 r21 : e00000000b24fdf8 r22 : e00000000b24fde0
r23 : 0000000000000000 r24 : e00000000b24fdd0 r25 : 0000000000000001
r26 : e00000000b24fdc8 r27 : 0000000000000000 r28 : e00000000b24fdc0
r29 : 0000000000000000 r30 : e000000039374940 r31 : 0000000000000000
                                                                    
Call Trace:
 [<a000000100016da0>] show_stack+0x80/0xa0
                                sp=e00000000b24f8e0 bsp=e00000000b2491f0
 [<a0000001000176b0>] show_regs+0x890/0x8c0                             
                                sp=e00000000b24fab0 bsp=e00000000b2491a8
 [<a00000010003e8f0>] die+0x150/0x240                                   
                                sp=e00000000b24fad0 bsp=e00000000b249168
 [<a000000100064440>] ia64_do_page_fault+0x8c0/0xbc0                    
                                sp=e00000000b24fad0 bsp=e00000000b249100
 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260                       
                                sp=e00000000b24fb80 bsp=e00000000b249100
 [<a00000020077eba0>] gfs_fsync+0x20/0x200 [gfs]                        
                                sp=e00000000b24fd50 bsp=e00000000b2490a8
 [<a0000002008a9470>] nfsd_sync_dir+0xb0/0x100 [nfsd]                   
                                sp=e00000000b24fdf0 bsp=e00000000b249078
 [<a0000002008b0f00>] nfsd_create+0x700/0x940 [nfsd]                    
                                sp=e00000000b24fdf0 bsp=e00000000b248fe8
 [<a0000002008c4b40>] nfsd3_proc_mkdir+0x1a0/0x220 [nfsd]               
                                sp=e00000000b24fdf0 bsp=e00000000b248f90
 [<a00000020089f820>] nfsd_dispatch+0x340/0x600 [nfsd]                  
                                sp=e00000000b24fdf0 bsp=e00000000b248f38
 [<a0000002004654d0>] svc_process+0x1630/0x1880 [sunrpc]                
                                sp=e00000000b24fdf0 bsp=e00000000b248ec0
 [<a00000020089efb0>] nfsd+0x490/0x9c0 [nfsd]                           
                                sp=e00000000b24fe00 bsp=e00000000b248e38
 [<a000000100018c70>] kernel_thread_helper+0x30/0x60                    
                                sp=e00000000b24fe30 bsp=e00000000b248e10
 [<a000000100008c60>] start_kernel_thread+0x20/0x40                     
                                sp=e00000000b24fe30 bsp=e00000000b248e10
Kernel panic - not syncing: Fatal exception

Comment 2 Nate Straz 2006-06-01 21:05:56 UTC

After further investigation, it appears that the panic is caused by a mkdir in
our tools on an NFS client when trying to start the NFS load.  Relocation had not
been attempted when the panic happened.

Comment 3 Wendy Cheng 2006-06-01 21:55:05 UTC

This is what happens...

RHEL4 defaults all exports with "sync" option .. so nfsd_create() goes to:

        if (EX_ISSYNC(fhp->fh_export)) {
                nfsd_sync_dir(dentry);
                write_inode_now(dchild->d_inode, 1);
        }

And nfsd_sync_dir() calls nfsd_dosync with filep set to NULL:
 
void nfsd_sync_dir(struct dentry *dp)
{
        nfsd_dosync(NULL, dp, dp->d_inode->i_fop);
}

so nfsd_dosync() passes gfs_sync() a NULL filep:

inline void nfsd_dosync(struct file *filp, struct dentry *dp,
                        struct file_operations *fop)
{
        struct inode *inode = dp->d_inode;
        int (*fsync) (struct file *, struct dentry *, int);

        filemap_fdatawrite(inode->i_mapping);
        if (fop && (fsync = fop->fsync))
                fsync(filp, dp, 0);
        filemap_fdatawait(inode->i_mapping);
}

And I used filp to get the mapping pointer....

ok, fix is on the way.

Comment 4 Wendy Cheng 2006-06-02 04:51:51 UTC

Code checked into CVS. Please re-try.

Comment 5 Nate Straz 2006-08-02 21:50:22 UTC

This bug isn't interfering with our testing anymore.

Comment 7 Red Hat Bugzilla 2006-08-10 21:35:51 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0561.html