Bug 369561 - RHEL5.1 kernel not reclaiming NFS locks when server reboots
Summary: RHEL5.1 kernel not reclaiming NFS locks when server reboots
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.1
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Steve Dickson
QA Contact: Martin Jenner
URL:
Whiteboard:
: 444741 (view as bug list)
Depends On:
Blocks: 240976 445360
TreeView+ depends on / blocked
 
Reported: 2007-11-07 13:06 UTC by Jeff Layton
Modified: 2018-10-19 18:14 UTC (History)
12 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 15:00:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description Jeff Layton 2007-11-07 13:06:32 UTC
While trying to reproduce the problem for bug 240976, I've noticed that the
RHEL5 kernel doesn't seem to attempt to reclaim locks when it gets a SM_NOTIFY.

Looking at a packet capture, I see the SM_NOTIFY request come in, and the reply
go out, but I don't see any NLM traffic at all. Checking /proc/locks on the
server shows that the lock isn't reclaimed well after the grace period.

I've done a binary search and have determined that the problem exists at least
as far back as -19.el5, but haven't checked anything earlier since brew doesn't
seem to have any earlier builds. -8.1.9.el5 seems to work correctly though, so
the problem was likely introduced somewhere in between those two revs.

Comment 1 Jeff Layton 2007-11-07 13:29:13 UTC
nlm_debug from -8.1.9.el5 kernel when SM_NOTIFY comes in:

Nov  7 08:20:36 dhcp231-179 kernel: lockd: request from 7f000001
Nov  7 08:20:36 dhcp231-179 kernel: lockd: SM_NOTIFY     called
Nov  7 08:20:36 dhcp231-179 kernel: lockd: nlm_lookup_host(ac103b81, p=6, v=4)
Nov  7 08:20:36 dhcp231-179 kernel: lockd: get host 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: lockd: get host 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: lockd: release host 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: nlmsvc_retry_blocked(0000000000000000, when=0)
Nov  7 08:20:36 dhcp231-179 kernel: nlmsvc_retry_blocked(0000000000000000, when=0)
Nov  7 08:20:36 dhcp231-179 kernel: lockd: rebind host 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: NLM: reclaiming locks for host
172.16.59.129lockd: call procedure 2 on 172.16.59.129
Nov  7 08:20:36 dhcp231-179 kernel: lockd: nlm_bind_host(ac103b81)
Nov  7 08:20:36 dhcp231-179 kernel: lockd: server returns status 0
Nov  7 08:20:36 dhcp231-179 kernel: NLM: done reclaiming locks for host
172.16.59.129lockd: release host 172.16.59.129

...nlm_debug from same event on -19.el5 kernel shows no messages from the
kernel, but this shows up in the logs:

Nov  7 08:28:59 dhcp231-179 rpc.statd[2030]: recv_rply: [10.11.231.179] service
100021 not registered





Comment 2 Jeff Layton 2007-11-07 13:38:51 UTC
Running statd in debug mode, I see this:

11/07/2007 08:37:44 rpc.statd[1995]: Received SM_NOTIFY from
dhcp59-129.rdu.redhat.com, state: 725
11/07/2007 08:37:44 rpc.statd[1995]: Waiting for reply... (timeo 5)
11/07/2007 08:37:44 rpc.statd[1995]: address mismatch: expected 10.11.231.179,
got 127.0.0.1

11/07/2007 08:37:44 rpc.statd[1995]: recv_rply: [10.11.231.179] service 100021
not registered
11/07/2007 08:37:44 rpc.statd[1995]: Waiting for client connections.



Comment 3 Jeff Layton 2007-11-07 13:48:15 UTC
When run on a 8.1.9.el5 kernel, it looks like this:

11/07/2007 08:44:59 rpc.statd[1956]: Version 1.0.9 Starting
11/07/2007 08:44:59 rpc.statd[1956]: Flags: No-Daemon Log-STDERR 
11/07/2007 08:44:59 rpc.statd[1956]: New state: 237
11/07/2007 08:44:59 rpc.statd[1956]: Waiting for client connections.
11/07/2007 08:45:20 rpc.statd[1956]: Waiting for client connections.
11/07/2007 08:45:20 rpc.statd[1956]: MONITORING 172.16.59.129 for
dhcp231-179.rdu.redhat.com
11/07/2007 08:45:20 rpc.statd[1956]: Waiting for client connections.
11/07/2007 08:45:37 rpc.statd[1956]: Received SM_NOTIFY from
dhcp59-129.rdu.redhat.com, state: 727
11/07/2007 08:45:37 rpc.statd[1956]: Waiting for reply... (timeo 5)
11/07/2007 08:45:37 rpc.statd[1956]: address mismatch: expected 10.11.231.179,
got 127.0.0.1

11/07/2007 08:45:37 rpc.statd[1956]: Waiting for reply... (timeo 5)
11/07/2007 08:45:37 rpc.statd[1956]: address mismatch: expected 10.11.231.179,
got 127.0.0.1

11/07/2007 08:45:37 rpc.statd[1956]: Callback to dhcp231-179.rdu.redhat.com (for
1871380944) succeeded.
11/07/2007 08:45:37 rpc.statd[1956]: Waiting for client connections.


Comment 4 Jeff Layton 2007-11-07 15:18:28 UTC
I think I see the problem...on more recent kernels, lockd is not registering a
UDP port, and statd seems to expect one:

2.6.18-19.el5xen:

# rpcinfo -p | grep nlockmgr
    100021    1   tcp  45684  nlockmgr
    100021    3   tcp  45684  nlockmgr
    100021    4   tcp  45684  nlockmgr

2.6.18-8.1.9.el5xen:

# rpcinfo -p | grep nlock
    100021    1   udp  32770  nlockmgr
    100021    3   udp  32770  nlockmgr
    100021    4   udp  32770  nlockmgr
    100021    1   tcp  47795  nlockmgr
    100021    3   tcp  47795  nlockmgr
    100021    4   tcp  47795  nlockmgr


Comment 6 Jeff Layton 2007-11-07 18:12:11 UTC
Looks like upstream is likely broken as well...

After talking with Steve, I think the right approach will be to fix statd and
allow it to talk to lockd via TCP if UDP fails. That's going to take a bit of
rework for statd though. It's got UDP concepts hardcoded in a lot of places.

At the moment, I'm trying to do a temporary patch to statd to make it talk TCP
to the kernel as a test to see whether that fixes this. A real fix will probably
take a bit...



Comment 8 RHEL Program Management 2007-11-07 18:25:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Jeff Layton 2007-11-20 11:20:23 UTC
There is a workaround for this:

If you set lockd to have a static UDP port then it will open a UDP socket even
for TCP mounts.

Comment 11 Jeff Layton 2007-11-30 13:36:11 UTC
...though this does not work if the patch for bug 313671 is in place (since it
changes nfslock to use sysctls instead of module parameters). We may want to
hold off on applying that patch to nfs-utils until this problem is resolved to
allow for the easy workaround.


Comment 14 Don Zickus 2008-01-24 21:15:44 UTC
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 16 Andrius Benokraitis 2008-02-15 14:41:19 UTC
Bikash, if you'd like to test this, pick up any kernel from the link in Comment
#14 newer than Package: kernel-2.6.18-63.el5

Comment 18 Jiri Skrabal 2008-05-06 14:00:30 UTC
*** Bug 444741 has been marked as a duplicate of this bug. ***

Comment 19 Jiri Skrabal 2008-05-06 14:04:03 UTC
Proposing for 5.1.z EUS (see the bug #444741 for explanation)

Comment 22 errata-xmlrpc 2008-05-21 15:00:56 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html



Note You need to log in before you can comment on or make changes to this bug.