Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 369561

Summary:	RHEL5.1 kernel not reclaiming NFS locks when server reboots
Product:	Red Hat Enterprise Linux 5	Reporter:	Jeff Layton <jlayton>
Component:	kernel	Assignee:	Steve Dickson <steved>
Status:	CLOSED ERRATA	QA Contact:	Martin Jenner <mjenner>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	5.1	CC:	amyagi, andriusb, bikash.choudhury, coughlan, dmair, juanino, lhh, lsmid, rkenna, staubach, steved, tao
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2008-0314	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-05-21 15:00:56 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	240976, 445360

Description Jeff Layton 2007-11-07 13:06:32 UTC

While trying to reproduce the problem for bug 240976, I've noticed that the
RHEL5 kernel doesn't seem to attempt to reclaim locks when it gets a SM_NOTIFY.

Looking at a packet capture, I see the SM_NOTIFY request come in, and the reply
go out, but I don't see any NLM traffic at all. Checking /proc/locks on the
server shows that the lock isn't reclaimed well after the grace period.

I've done a binary search and have determined that the problem exists at least
as far back as -19.el5, but haven't checked anything earlier since brew doesn't
seem to have any earlier builds. -8.1.9.el5 seems to work correctly though, so
the problem was likely introduced somewhere in between those two revs.

Comment 1 Jeff Layton 2007-11-07 13:29:13 UTC

nlm_debug from -8.1.9.el5 kernel when SM_NOTIFY comes in:

Nov  7 08:20:36 dhcp231-179 kernel: lockd: request from 7f000001
Nov  7 08:20:36 dhcp231-179 kernel: lockd: SM_NOTIFY     called
Nov  7 08:20:36 dhcp231-179 kernel: lockd: nlm_lookup_host(ac103b81, p=6, v=4)
Nov  7 08:20:36 dhcp231-179 kernel: lockd: get host 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: lockd: get host 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: lockd: release host 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: nlmsvc_retry_blocked(0000000000000000, when=0)
Nov  7 08:20:36 dhcp231-179 kernel: nlmsvc_retry_blocked(0000000000000000, when=0)
Nov  7 08:20:36 dhcp231-179 kernel: lockd: rebind host 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: NLM: reclaiming locks for host
172.16.59.129lockd: call procedure 2 on 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: lockd: nlm_bind_host(ac103b81)
Nov  7 08:20:36 dhcp231-179 kernel: lockd: server returns status 0
Nov  7 08:20:36 dhcp231-179 kernel: NLM: done reclaiming locks for host
172.16.59.129lockd: release host 172.16.59.129

...nlm_debug from same event on -19.el5 kernel shows no messages from the
kernel, but this shows up in the logs:

Nov  7 08:28:59 dhcp231-179 rpc.statd[2030]: recv_rply: [10.11.231.179] service
100021 not registered

Comment 2 Jeff Layton 2007-11-07 13:38:51 UTC

Running statd in debug mode, I see this:

11/07/2007 08:37:44 rpc.statd[1995]: Received SM_NOTIFY from
dhcp59-129.rdu.redhat.com, state: 725
11/07/2007 08:37:44 rpc.statd[1995]: Waiting for reply... (timeo 5)
11/07/2007 08:37:44 rpc.statd[1995]: address mismatch: expected 10.11.231.179,
got 127.0.0.1

11/07/2007 08:37:44 rpc.statd[1995]: recv_rply: [10.11.231.179] service 100021
not registered
11/07/2007 08:37:44 rpc.statd[1995]: Waiting for client connections.

Comment 3 Jeff Layton 2007-11-07 13:48:15 UTC

When run on a 8.1.9.el5 kernel, it looks like this:

11/07/2007 08:44:59 rpc.statd[1956]: Version 1.0.9 Starting
11/07/2007 08:44:59 rpc.statd[1956]: Flags: No-Daemon Log-STDERR 
11/07/2007 08:44:59 rpc.statd[1956]: New state: 237
11/07/2007 08:44:59 rpc.statd[1956]: Waiting for client connections.
11/07/2007 08:45:20 rpc.statd[1956]: Waiting for client connections.
11/07/2007 08:45:20 rpc.statd[1956]: MONITORING 172.16.59.129 for
dhcp231-179.rdu.redhat.com
11/07/2007 08:45:20 rpc.statd[1956]: Waiting for client connections.
11/07/2007 08:45:37 rpc.statd[1956]: Received SM_NOTIFY from
dhcp59-129.rdu.redhat.com, state: 727
11/07/2007 08:45:37 rpc.statd[1956]: Waiting for reply... (timeo 5)
11/07/2007 08:45:37 rpc.statd[1956]: address mismatch: expected 10.11.231.179,
got 127.0.0.1

11/07/2007 08:45:37 rpc.statd[1956]: Waiting for reply... (timeo 5)
11/07/2007 08:45:37 rpc.statd[1956]: address mismatch: expected 10.11.231.179,
got 127.0.0.1

11/07/2007 08:45:37 rpc.statd[1956]: Callback to dhcp231-179.rdu.redhat.com (for
1871380944) succeeded.
11/07/2007 08:45:37 rpc.statd[1956]: Waiting for client connections.

Comment 4 Jeff Layton 2007-11-07 15:18:28 UTC

I think I see the problem...on more recent kernels, lockd is not registering a
UDP port, and statd seems to expect one:

2.6.18-19.el5xen:

# rpcinfo -p | grep nlockmgr
    100021    1   tcp  45684  nlockmgr
    100021    3   tcp  45684  nlockmgr
    100021    4   tcp  45684  nlockmgr

2.6.18-8.1.9.el5xen:

# rpcinfo -p | grep nlock
    100021    1   udp  32770  nlockmgr
    100021    3   udp  32770  nlockmgr
    100021    4   udp  32770  nlockmgr
    100021    1   tcp  47795  nlockmgr
    100021    3   tcp  47795  nlockmgr
    100021    4   tcp  47795  nlockmgr

Comment 6 Jeff Layton 2007-11-07 18:12:11 UTC

Looks like upstream is likely broken as well...

After talking with Steve, I think the right approach will be to fix statd and
allow it to talk to lockd via TCP if UDP fails. That's going to take a bit of
rework for statd though. It's got UDP concepts hardcoded in a lot of places.

At the moment, I'm trying to do a temporary patch to statd to make it talk TCP
to the kernel as a test to see whether that fixes this. A real fix will probably
take a bit...

Comment 8 RHEL Program Management 2007-11-07 18:25:21 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Jeff Layton 2007-11-20 11:20:23 UTC

There is a workaround for this:

If you set lockd to have a static UDP port then it will open a UDP socket even
for TCP mounts.

Comment 11 Jeff Layton 2007-11-30 13:36:11 UTC

...though this does not work if the patch for bug 313671 is in place (since it
changes nfslock to use sysctls instead of module parameters). We may want to
hold off on applying that patch to nfs-utils until this problem is resolved to
allow for the easy workaround.

Comment 14 Don Zickus 2008-01-24 21:15:44 UTC

You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 16 Andrius Benokraitis 2008-02-15 14:41:19 UTC

Bikash, if you'd like to test this, pick up any kernel from the link in Comment
#14 newer than Package: kernel-2.6.18-63.el5

Comment 18 Jiri Skrabal 2008-05-06 14:00:30 UTC

*** Bug 444741 has been marked as a duplicate of this bug. ***

Comment 19 Jiri Skrabal 2008-05-06 14:04:03 UTC

Proposing for 5.1.z EUS (see the bug #444741 for explanation)

Comment 22 errata-xmlrpc 2008-05-21 15:00:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html