Description of problem: I had a three node GFS/rgmanager cluster exporting a GFS/NFS/IP service to an NFS client. I started some simple I/O from that client to the filesystem and then ran a test "derringer" which randomly recovers machines and relocates services. As one of the dead machine was coming back up, it hit this panic. Resource info: <rm> <failoverdomains> <failoverdomain name="ALL" ordered="0" restricted="0"> <failoverdomainnode name="taft-02" priority="1"/> <failoverdomainnode name="taft-03" priority="1"/> <failoverdomainnode name="taft-04" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.15.84.97" monitor_link="1"/> <clusterfs device="/dev/taft/taft0" force_unmount="0" fstype="gfs" mountpoint="/mnt/taft0" name="taft0" options=""/> <clusterfs device="/dev/taft/taft1" force_unmount="0" fstype="gfs" mountpoint="/mnt/taft1" name="taft1" options=""/> <nfsexport name="nfs exports"/> <nfsclient name="joynter" options="rw" target="joynter.lab.msp.redhat.com"/> <script file="/usr/tests/sts/rgmanager/bin/logman" name="logman"/> </resources> <service autostart="1" domain="ALL" name="GFS0"> <clusterfs ref="taft0"> <nfsexport ref="nfs exports"> <nfsclient ref="joynter"/> </nfsexport> </clusterfs> <ip ref="10.15.84.97"/> </service> <service autostart="1" domain="ALL" name="GFS1"> <clusterfs ref="taft1"> <nfsexport ref="nfs exports"> <nfsclient ref="joynter"/> </nfsexport> </clusterfs> <ip ref="10.15.84.97"/> </service> <service autostart="1" domain="ALL" name="logman"> <script ref="logman"/> </service> </rm> [root@taft-03 ~]# clustat Member Status: Quorate Member Name Status ------ ---- ------ taft-02 Offline taft-03 Online, Local, rgmanager taft-04 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- GFS0 taft-04 started GFS1 taft-04 started logman taft-04 started Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: <ffffffffa0192918>{:sunrpc:svc_register+28} PML4 213a6b067 PGD 0 Oops: 0000 [1] SMP CPU 3 Modules linked in: nfsd exportfs lockd parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core button battery ac uhci_hcd ehci_hcd hw_random e1000 floppy qla2300 qla2xxx sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod lpfc scsi_transport_fc megaraid_mbox megaraid_mm sd_mod scsi_mod Pid: 3822, comm: rpc.nfsd Not tainted 2.6.9-22.0.1.ELsmp RIP: 0010:[<ffffffffa0192918>] <ffffffffa0192918>{:sunrpc:svc_register+28} RSP: 0018:00000102135f3d98 EFLAGS: 00010246 RAX: 0000000000000001 RBX: 00000102162d5280 RCX: 0000000000000000 RDX: 0000000000000801 RSI: 0000000000000006 RDI: 0000000000000000 RBP: 0000010218e41b80 R08: ffffffff804d4618 R09: 0000000000000000 R10: 0000010218e41b80 R11: 00000000000000b8 R12: 0000000000000801 R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000001 FS: 0000002a9589fb00(0000) GS:ffffffff804d3200(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000020 CR3: 0000000037e34000 CR4: 00000000000006e0 Process rpc.nfsd (pid: 3822, threadinfo 00000102135f2000, task 00000102147b57f0) Stack: ffffffff804d4618 ffffffff804d4618 00000102162d5280 0000010218e41b80 00000102135f3e1c 0000010214131980 0000000000000000 ffffffffa0194b2f ffffffff804d4618 0000000000000000 Call Trace:<ffffffffa0194b2f>{:sunrpc:svc_setup_socket+155} <ffffffffa01956fb>{:sunrpc:svc_makesock+337} <ffffffffa02ee1b5>{:nfsd:nfsd_svc+419} <ffffffffa02eecfd>{:nfsd:write_threads+131} <ffffffff8015b7d3>{get_zeroed_page+107} <ffffffff80195249>{simple_transaction_get+152} <ffffffffa02ee8c2>{:nfsd:nfsctl_transaction_write+78} <ffffffff80177064>{vfs_write+207} <ffffffff8017714c>{sys_write+69} <ffffffff80110052>{system_call+126} Code: 48 8b 6f 20 74 32 83 fe 11 0f b7 ca 48 c7 c2 ed d3 19 a0 74 RIP <ffffffffa0192918>{:sunrpc:svc_register+28} RSP <00000102135f3d98> CR2: 0000000000000020 <0>Kernel panic - not syncing: Oops Version-Release number of selected component (if applicable): Linux taft-03 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:02 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux GFS 2.6.9-45.0 (built Nov 28 2005 11:39:41) installed CMAN 2.6.9-41.0 (built Nov 28 2005 11:26:37) installed rgmanager-1.9.43-0 How reproducible: often
Bumping the priority as QA is hitting this bug regularly during rgmanager/NFS testing.
Just hit and filed a similar bz (if not a dup to this one) 190401. Any update from devel on this issue?
*** Bug 190401 has been marked as a duplicate of this bug. ***
It appears the nfsd_serv pointer is becoming null while its still in use... What exactly do the "derringer" test do?
Basically all that derringer was doing was relocating the HA GFS services from one machine to another (clusvcadm -r $servicename -m $newserviceowner). This was happening all the while there was I/O going from the NFS clients to those filesystems. Lately however, we have seen these panics when just starting the HA GFS services, that is we get a valid cluster up, mount some GFS filesystems, then do a 'service rgmamager start' which starts clurgmgrd (which I believe then takes care of all the NFS and exportfs stuff). I'm currently attempting to reproduce this with just ext filesystems.
GFS is not required for this issue, I was able to recreate this using just ext3 filesystems. link-01: May 9 08:29:36 taft-01 clurgmgrd[25695]: <notice> Resource Group Manager Starting May 9 08:29:36 taft-01 clurgmgrd[25695]: <info> Loading Service Data May 9 08:29:36 taft-01 rgmanager: clurgmgrd startup succeeded May 9 08:29:36 taft-01 clurgmgrd[25695]: <info> Initializing Services May 9 08:29:36 taft-01 clurgmgrd: [25695]: <info> Removing export: *:/mnt/taft0 May 9 08:29:36 taft-01 clurgmgrd: [25695]: <info> Removing export: *:/mnt/taft1 May 9 08:29:36 taft-01 kernel: Installing knfsd (copyright (C) 1996 okir.de). May 9 08:29:36 taft-01 clurgmgrd: [25695]: <err> NFS daemon nfsd is not running. May 9 08:29:36 taft-01 clurgmgrd: [25695]: <err> NFS daemon nfsd is not running. May 9 08:29:36 taft-01 clurgmgrd: [25695]: <err> Verify that the NFS service run level script is enable May 9 08:29:36 taft-01 clurgmgrd: [25695]: <err> Verify that the NFS service run level script is enable May 9 08:29:36 taft-01 clurgmgrd: [25695]: <err> Restarting NFS daemons May 9 08:29:36 taft-01 clurgmgrd: [25695]: <err> Restarting NFS daemons May 9 08:29:36 taft-01 rpc.statd[2646]: Caught signal 15, un-registering and exiting. May 9 08:29:36 taft-01 nfslock: rpc.statd shutdown succeeded May 9 08:29:36 taft-01 nfslock: rpc.statd shutdown succeeded May 9 08:29:36 taft-01 rpc.statd[25972]: Version 1.0.6 Starting May 9 08:29:36 taft-01 rpc.statd[25973]: Version 1.0.6 Starting May 9 08:29:36 taft-01 rpc.statd[25972]: unable to register (statd, 1, udp). May 9 08:29:36 taft-01 nfslock: rpc.statd startup succeeded May 9 08:29:36 taft-01 nfslock: rpc.statd startup succeeded Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP: <ffffffffa02d11d8>{:nfsd:nfsd_svc+454} [...]
Steve, I bet the sock->sk socket has been shutdowned (inet_shutdown) where the socket got released (release_sock). The address taking over must happen sometime between sock->ops->listen() and svc_setup_socket() within svc_create_socket(). How to fix this ? No idea at this moment.
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.