175629 – rpc.nfsd kernel panic after starting HA filesystem services

Bug 175629 - rpc.nfsd kernel panic after starting HA filesystem services

Summary: rpc.nfsd kernel panic after starting HA filesystem services

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ric Wheeler
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	190401 (view as bug list)
Depends On:
Blocks:	176344
TreeView+	depends on / blocked

Reported:	2005-12-13 15:53 UTC by Corey Marthaler
Modified:	2012-06-20 13:23 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 13:23:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2005-12-13 15:53:54 UTC

Description of problem:
I had a three node GFS/rgmanager cluster exporting a GFS/NFS/IP service to an
NFS client. I started some simple I/O from that client to the filesystem and
then ran a test "derringer" which randomly recovers machines and relocates
services. As one of the dead machine was coming back up, it hit this panic.

Resource info:
        <rm>
                <failoverdomains>
                        <failoverdomain name="ALL" ordered="0" restricted="0">
                                <failoverdomainnode name="taft-02" priority="1"/>
                                <failoverdomainnode name="taft-03" priority="1"/>
                                <failoverdomainnode name="taft-04" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.15.84.97" monitor_link="1"/>
                        <clusterfs device="/dev/taft/taft0" force_unmount="0"
fstype="gfs" mountpoint="/mnt/taft0" name="taft0" options=""/>
                        <clusterfs device="/dev/taft/taft1" force_unmount="0"
fstype="gfs" mountpoint="/mnt/taft1" name="taft1" options=""/>
                        <nfsexport name="nfs exports"/>
                        <nfsclient name="joynter" options="rw"
target="joynter.lab.msp.redhat.com"/>
                        <script file="/usr/tests/sts/rgmanager/bin/logman"
name="logman"/>
                </resources>
                <service autostart="1" domain="ALL" name="GFS0">
                        <clusterfs ref="taft0">
                                <nfsexport ref="nfs exports">
                                        <nfsclient ref="joynter"/>
                                </nfsexport>
                        </clusterfs>
                        <ip ref="10.15.84.97"/>
                </service>
                <service autostart="1" domain="ALL" name="GFS1">
                        <clusterfs ref="taft1">
                                <nfsexport ref="nfs exports">
                                        <nfsclient ref="joynter"/>
                                </nfsexport>
                        </clusterfs>
                        <ip ref="10.15.84.97"/>
                </service>
                <service autostart="1" domain="ALL" name="logman">
                        <script ref="logman"/>
                </service>
        </rm>

[root@taft-03 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  taft-02                                  Offline
  taft-03                                  Online, Local, rgmanager
  taft-04                                  Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  GFS0                 taft-04                        started
  GFS1                 taft-04                        started
  logman               taft-04                        started


Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
<ffffffffa0192918>{:sunrpc:svc_register+28}
PML4 213a6b067 PGD 0
Oops: 0000 [1] SMP
CPU 3
Modules linked in: nfsd exportfs lockd parport_pc lp parport autofs4 i2c_dev
i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds
yenta_socket pcmcia_core button battery ac uhci_hcd ehci_hcd hw_random e1000
floppy qla2300 qla2xxx sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod lpfc
scsi_transport_fc megaraid_mbox megaraid_mm sd_mod scsi_mod
Pid: 3822, comm: rpc.nfsd Not tainted 2.6.9-22.0.1.ELsmp
RIP: 0010:[<ffffffffa0192918>] <ffffffffa0192918>{:sunrpc:svc_register+28}
RSP: 0018:00000102135f3d98  EFLAGS: 00010246
RAX: 0000000000000001 RBX: 00000102162d5280 RCX: 0000000000000000
RDX: 0000000000000801 RSI: 0000000000000006 RDI: 0000000000000000
RBP: 0000010218e41b80 R08: ffffffff804d4618 R09: 0000000000000000
R10: 0000010218e41b80 R11: 00000000000000b8 R12: 0000000000000801
R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000001
FS:  0000002a9589fb00(0000) GS:ffffffff804d3200(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 0000000037e34000 CR4: 00000000000006e0
Process rpc.nfsd (pid: 3822, threadinfo 00000102135f2000, task 00000102147b57f0)
Stack: ffffffff804d4618 ffffffff804d4618 00000102162d5280 0000010218e41b80
       00000102135f3e1c 0000010214131980 0000000000000000 ffffffffa0194b2f
       ffffffff804d4618 0000000000000000
Call Trace:<ffffffffa0194b2f>{:sunrpc:svc_setup_socket+155}
<ffffffffa01956fb>{:sunrpc:svc_makesock+337}
       <ffffffffa02ee1b5>{:nfsd:nfsd_svc+419}
<ffffffffa02eecfd>{:nfsd:write_threads+131}
       <ffffffff8015b7d3>{get_zeroed_page+107}
<ffffffff80195249>{simple_transaction_get+152}
       <ffffffffa02ee8c2>{:nfsd:nfsctl_transaction_write+78}
       <ffffffff80177064>{vfs_write+207} <ffffffff8017714c>{sys_write+69}
       <ffffffff80110052>{system_call+126}

Code: 48 8b 6f 20 74 32 83 fe 11 0f b7 ca 48 c7 c2 ed d3 19 a0 74
RIP <ffffffffa0192918>{:sunrpc:svc_register+28} RSP <00000102135f3d98>
CR2: 0000000000000020
 <0>Kernel panic - not syncing: Oops

Version-Release number of selected component (if applicable):
Linux taft-03 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:02 EDT 2005 x86_64
x86_64 x86_64 GNU/Linux
GFS 2.6.9-45.0 (built Nov 28 2005 11:39:41) installed
CMAN 2.6.9-41.0 (built Nov 28 2005 11:26:37) installed
rgmanager-1.9.43-0

How reproducible:
often

Comment 1 Corey Marthaler 2005-12-14 18:07:43 UTC

Bumping the priority as QA is hitting this bug regularly during rgmanager/NFS
testing.

Comment 2 Corey Marthaler 2006-05-01 22:27:45 UTC

Just hit and filed a similar bz (if not a dup to this one) 190401.

Any update from devel on this issue?

Comment 3 Steve Dickson 2006-05-09 10:06:51 UTC

*** Bug 190401 has been marked as a duplicate of this bug. ***

Comment 4 Steve Dickson 2006-05-09 10:09:59 UTC

It appears the nfsd_serv pointer is becoming null while its still
in use...  What exactly do the "derringer" test do?

Comment 5 Corey Marthaler 2006-05-09 18:51:51 UTC

Basically all that derringer was doing was relocating the HA GFS services from
one machine to another (clusvcadm -r $servicename -m $newserviceowner). This was
happening all the while there was I/O going from the NFS clients to those
filesystems. 

Lately however, we have seen these panics when just starting the HA GFS
services, that is we get a valid cluster up, mount some GFS filesystems, then do
a 'service rgmamager start' which starts clurgmgrd (which I believe then takes
care of all the NFS and exportfs stuff). 

I'm currently attempting to reproduce this with just ext filesystems.

Comment 6 Corey Marthaler 2006-05-09 19:33:57 UTC

GFS is not required for this issue, I was able to recreate this using just ext3
filesystems.

link-01:

May  9 08:29:36 taft-01 clurgmgrd[25695]: <notice> Resource Group Manager Starting
May  9 08:29:36 taft-01 clurgmgrd[25695]: <info> Loading Service Data
May  9 08:29:36 taft-01 rgmanager: clurgmgrd startup succeeded
May  9 08:29:36 taft-01 clurgmgrd[25695]: <info> Initializing Services
May  9 08:29:36 taft-01 clurgmgrd: [25695]: <info> Removing export: *:/mnt/taft0
May  9 08:29:36 taft-01 clurgmgrd: [25695]: <info> Removing export: *:/mnt/taft1
May  9 08:29:36 taft-01 kernel: Installing knfsd (copyright (C) 1996
okir.de).
May  9 08:29:36 taft-01 clurgmgrd: [25695]: <err> NFS daemon nfsd is not running.
May  9 08:29:36 taft-01 clurgmgrd: [25695]: <err> NFS daemon nfsd is not running.
May  9 08:29:36 taft-01 clurgmgrd: [25695]: <err> Verify that the NFS service
run level script is enable
May  9 08:29:36 taft-01 clurgmgrd: [25695]: <err> Verify that the NFS service
run level script is enable
May  9 08:29:36 taft-01 clurgmgrd: [25695]: <err> Restarting NFS daemons
May  9 08:29:36 taft-01 clurgmgrd: [25695]: <err> Restarting NFS daemons
May  9 08:29:36 taft-01 rpc.statd[2646]: Caught signal 15, un-registering and
exiting.
May  9 08:29:36 taft-01 nfslock: rpc.statd shutdown succeeded
May  9 08:29:36 taft-01 nfslock: rpc.statd shutdown succeeded
May  9 08:29:36 taft-01 rpc.statd[25972]: Version 1.0.6 Starting
May  9 08:29:36 taft-01 rpc.statd[25973]: Version 1.0.6 Starting
May  9 08:29:36 taft-01 rpc.statd[25972]: unable to register (statd, 1, udp).
May  9 08:29:36 taft-01 nfslock: rpc.statd startup succeeded
May  9 08:29:36 taft-01 nfslock: rpc.statd startup succeeded
Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP:
<ffffffffa02d11d8>{:nfsd:nfsd_svc+454}
[...]

Comment 7 Wendy Cheng 2006-05-09 20:51:44 UTC

Steve, I bet the sock->sk socket has been shutdowned (inet_shutdown) where the
socket got released (release_sock). The address taking over must happen sometime
between sock->ops->listen() and svc_setup_socket() within svc_create_socket(). 

How to fix this ? No idea at this moment.

Comment 15 RHEL Program Management 2007-09-07 19:46:31 UTC

This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 17 Jiri Pallich 2012-06-20 13:23:21 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.