Bug 2063818

Summary:	NFS lockd become out-of-sync between client & server when client process killed with signal
Product:	Red Hat Enterprise Linux 9	Reporter:	Yongcheng Yang <yoyang>
Component:	kernel	Assignee:	Jeff Layton <jlayton>
kernel sub component:	NFS	QA Contact:	Zhi Li <yieli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	jlayton, nfs-team, xzhou, yoyang
Version:	9.0	Keywords:	Patch, Reproducer, Triaged
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-5.14.0-329.el9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-11-07 08:38:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yongcheng Yang 2022-03-14 13:08:44 UTC

Description of problem:
When testing nfs v3 lockd in rhel9, there's an out-of-sync lockd status. And it might prevent the nfs server restarting the nfs-server service.

I.e., after nfs client start/stop the nfs lock (from https://bugzilla.redhat.com/show_bug.cgi?id=1234513#c1) several times, the nfs server still has a lockd. Then trying to restart the nfs-server on server gets hang. I have to reboot the server to work around it.


Version-Release number of selected component (if applicable):
since kernel-5.14.0-68.el9

How reproducible:
always

Steps to Reproduce:
1. Mount nfs in version 3
2. On the client mount point, run reproducer of https://bugzilla-attachments.redhat.com/attachment.cgi?id=1041934&t=Yh0Cfc6GnQ

The reproducer is like:

# gcc test_lockun-1.c -o test_lockun

# while [[ 1 -eq 1 ]] ; do timeout 3 /root/test_lockun-1 /mnt/nfsmount/stats ; sleep 1 ; done

reproducer forks 20 workers, which each:
  while (1) {
    fd = open(test_file, O_RDWR)
    fcntl(fd, F_SETLKW, &fl) (fl is F_WRLCK, SEEK_SET, start=0, len=1)
    read(fd, buf, THREADS)
    lseek(fd, SEEK_SET, 0)
    write(fd, buf, THREADS)
    close(fd)
  }    


Actual results: ### kernel-5.14.0-68.el9 ###
Beaker job: https://beaker.engineering.redhat.com/jobs/6395455

[03:38:13 root@ ~~]# ps aux | grep -v grep | grep tmux
root       24656  0.0  0.0   5932  3276 ?        Ss   03:38   0:00 tmux new -d while true; do timeout 3 ./test_lock /mnt/nfsmp-bz1234513/stats; sleep 1; done &> tmux.log
root       24657  0.0  0.0   4192  3088 pts/0    Ss+  03:38   0:00 bash -c while true; do timeout 3 ./test_lock /mnt/nfsmp-bz1234513/stats; sleep 1; done &> tmux.log
[03:38:13 root@ ~~]# lslocks
COMMAND           PID  TYPE SIZE MODE  M START END PATH
rpcbind          1175 FLOCK      WRITE 0     0   0 /run/rpcbind/rpcbind.lock
test_lock       24683 POSIX  20B WRITE 0     0   0 /mnt/nfsmp-bz1234513/stats   <<< on client
rhsmcertd        1268 FLOCK      WRITE 0     0   0 /run/lock/subsys/rhsmcertd
crond           20649 FLOCK   6B WRITE 0     0   0 /run/crond.pid
[03:38:13 root@ ~~]# ssh fsqe-r640-02.fs.lab.eng.bos.redhat.com lslocks
COMMAND           PID  TYPE SIZE MODE   M START END PATH
rpcbind          1214 FLOCK      WRITE  0     0   0 /run/rpcbind/rpcbind.lock
rhsmcertd        1353 FLOCK      WRITE  0     0   0 /run/lock/subsys/rhsmcertd
crond            8647 FLOCK   5B WRITE  0     0   0 /run/crond.pid
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE* 0     0   0 /...
lockd           12877 POSIX      WRITE  0     0   0 /...
[03:38:13 root@ ~~]# sleep 100
[03:39:53 root@ ~~]# pkill -9 tmux
[03:39:53 root@ ~~]# ps aux | grep -v grep | grep tmux
[03:39:53 root@ ~~]# sleep 10
[03:40:03 root@ ~~]# lslocks
COMMAND           PID  TYPE SIZE MODE  M START END PATH
rpcbind          1175 FLOCK      WRITE 0     0   0 /run/rpcbind/rpcbind.lock
rhsmcertd        1268 FLOCK      WRITE 0     0   0 /run/lock/subsys/rhsmcertd
crond           20649 FLOCK   6B WRITE 0     0   0 /run/crond.pid
[03:40:03 root@ ~~]# ssh fsqe-r640-02.fs.lab.eng.bos.redhat.com lslocks
COMMAND           PID  TYPE SIZE MODE  M START END PATH
rpcbind          1214 FLOCK      WRITE 0     0   0 /run/rpcbind/rpcbind.lock
rhsmcertd        1353 FLOCK      WRITE 0     0   0 /run/lock/subsys/rhsmcertd
crond            8647 FLOCK   5B WRITE 0     0   0 /run/crond.pid
lockd           12877 POSIX      WRITE 0     0   0 /...                      <<<< out-of-sync lockd on server after test


Expected results: ### kernel-5.14.0-67.el9 ###
Beaker job: https://beaker.engineering.redhat.com/jobs/6395453

...
[03:39:23 root@ ~~]# ssh fsqe-r430-02.lab.eng.bos.redhat.com lslocks
COMMAND           PID  TYPE SIZE MODE   M START END PATH
rhsmcertd        1761 FLOCK      WRITE  0     0   0 /run/lock/subsys/rhsmcertd
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE* 0     0   0 /...
lockd           11951 POSIX      WRITE  0     0   0 /...
rpcbind          1598 FLOCK      WRITE  0     0   0 /run/rpcbind/rpcbind.lock
crond            5097 FLOCK   5B WRITE  0     0   0 /run/crond.pid
[03:39:24 root@ ~~]# sleep 100
[03:41:04 root@ ~~]# pkill -9 tmux
[03:41:04 root@ ~~]# ps aux | grep -v grep | grep tmux
[03:41:06 root@ ~~]# sleep 10
[03:41:16 root@ ~~]# lslocks
COMMAND           PID  TYPE SIZE MODE  M START END PATH
rhsmcertd        1274 FLOCK      WRITE 0     0   0 /run/lock/subsys/rhsmcertd
crond           20650 FLOCK   6B WRITE 0     0   0 /run/crond.pid
rpcbind          1171 FLOCK      WRITE 0     0   0 /run/rpcbind/rpcbind.lock
[03:41:16 root@ ~~]# ssh fsqe-r430-02.lab.eng.bos.redhat.com lslocks
COMMAND           PID  TYPE SIZE MODE  M START END PATH
rhsmcertd        1761 FLOCK      WRITE 0     0   0 /run/lock/subsys/rhsmcertd
rpcbind          1598 FLOCK      WRITE 0     0   0 /run/rpcbind/rpcbind.lock
crond            5097 FLOCK   5B WRITE 0     0   0 /run/crond.pid

Additional info:
 - This test fail in both rhel9 and rhel8 most of the time;
 - Since kernel-5.14.0-66.el9 there's a new phenomenon i.e. hang in restarting nfs-server (tracked by bug 2082816)

Comment 1 Jeff Layton 2022-08-02 13:58:12 UTC

I've been unable to reproduce this so far with recent centos9 kernels on the server. How long do I need to leave the test program running? 

Can you try this again with recent kernel on the server and see if it's still reproducible for you?

Comment 2 Yongcheng Yang 2022-08-04 01:53:33 UTC

Turns out that server's hang is another problem which has already resolved in Bug 2082816 (kernel 5.14.0-133.el9) now.

But the nfsv3 client out-of-sync lockd issue still exist e.g. (5.14.0-139.el9) https://beaker.engineering.redhat.com/jobs/6880428

It should not take much time. In my test, running the simple lock test for 100s and quit.

Comment 3 Yongcheng Yang 2022-08-04 06:51:59 UTC

Have just checked this with upstream kernel (5.19.0+) and can also reproduce.

E.g. https://beaker.engineering.redhat.com/recipes/12391604#task148308520

Comment 5 Jeff Layton 2022-09-19 18:13:17 UTC

I was finally able to reproduce this, using a command like this:

    tmux new -d 'while true; do timeout 3 ~/test_lockun-1 /mnt/knfsd/stats; sleep 1; done'

...and then killing tmux with a SIGKILL like you were. After there is a stuck lock, I see this pr_warn pop when I shut down the nfs server.

    lockd: couldn't shutdown host module for net %x!\n"

That suggests that this is a server-side bug, since if it were just client-side, we'd be able to shut down lockd.

Comment 6 Jeff Layton 2022-09-19 18:37:12 UTC

Reproduced it again and turned up nlm debugging before shutting down lockd:

[  912.761564] lockd: nuking all hosts in net f0000000...
[  912.763256] lockd: host garbage collection for net f0000000
[  912.764270] lockd: nlmsvc_mark_resources for net f0000000
[  912.765367] nlm_gc_hosts skipping client2 (cnt 2 use 1 exp 4295579981 net f0000000)
[  912.766581] lockd: couldn't shutdown host module for net f0000000!
[  912.767585] lockd: 1 hosts left in net f0000000:
[  912.768233]        client2 (cnt 2 use 1 exp 4295579981 net f0000000)
[  912.782619] lockd: release server host client2
[  912.783831] lockd: shutting down host module
[  912.784991] lockd: nuking all hosts in net 0...
[  912.786215] lockd: host garbage collection for net 0
[  912.787549] lockd: nlmsvc_mark_resources for net 0
[  912.788832] lockd: destroy host client2
[  912.789880] lockd: destroyed nsm_handle for client2 (192.168.1.210)
[  912.791599] lockd_down: service stopped
[  912.793538] lockd_down: service destroyed
[  912.802879] nfsd: last server has exited, flushing export cache

So the h_inuse value is high at this point. Working on a patch to determine what's happening there.

Comment 7 Jeff Layton 2022-09-20 13:56:00 UTC

The problem is that there is still an outstanding lock. I don't see how we'd leave one lingering in the server side code, so I'm actually suspecting an NLM client bug at this point -- one that causes the final UNLOCK call to not happen (possibly because of a fatal pending signal). Trying to consider now how to go about determining this. I may need to add some tracepoints.

Comment 8 Jeff Layton 2023-03-01 12:42:09 UTC

I think this most likely both a client and server bug. The server doesn't tear down a host's resources properly before shutting down which is what causes the "couldn't shutdown host module" message. That's fairly simple to fix (and I have a patch I'm proposing upstream to fix it).

I suspect that the other part of this bug is client-side. Sometimes the client seems to fail to send an UNLOCK message at shutdown time when a process is killed. I haven't quite nailed that bit down yet.

For now, let's keep this bug focused on fixing the known server-side issue, and we can clone this later to pursue the client-side issue.

Comment 9 Jeff Layton 2023-03-01 19:54:10 UTC

Ok, dusting off some lockd tracepoints I had done a while back produced this while running this test. Start at 111s:
           <...>-1043    [001] .....   111.748219: nlmclnt_lock: addr=192.168.1.137:0 oh=0x9ebbced4 svid=0x00000000 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED

...then, we get this series of events. First, the lock gets granted and released a couple of times:

           <...>-1058    [002] .....   114.659765: nlmclnt_lock: addr=192.168.1.137:0 oh=0xea9e714a svid=0x000001b8 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED
           <...>-1058    [002] .....   114.666783: nlmclnt_unlock: addr=192.168.1.137:0 oh=0xea9e714a svid=0x000001b8 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED
           <...>-960     [000] .....   114.666839: nlmclnt_grant: addr=192.168.1.137:681 oh=0xfde56509 svid=0x000001b9 fh=0x67be0d7b start=0 len=1 status=LCK_GRANTED
           <...>-1062    [006] .....   114.666884: nlmclnt_lock: addr=192.168.1.137:0 oh=0xfde56509 svid=0x000001b9 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED
           <...>-1062    [006] .....   114.674724: nlmclnt_unlock: addr=192.168.1.137:0 oh=0xfde56509 svid=0x000001b9 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED

...then svid=0x1ba gets the lock:

           <...>-960     [000] .....   114.674877: nlmclnt_grant: addr=192.168.1.137:681 oh=0xc46859cc svid=0x000001ba fh=0x67be0d7b start=0 len=1 status=LCK_GRANTED
           <...>-1059    [003] .....   114.674931: nlmclnt_lock: addr=192.168.1.137:0 oh=0xc46859cc svid=0x000001ba fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED

...then suddenly it gets granted to 0x1bb, even though 0x1ba still holds it.

           <...>-960     [000] .....   114.683161: nlmclnt_grant: addr=192.168.1.137:681 oh=0xd3134d8f svid=0x000001bb fh=0x67be0d7b start=0 len=1 status=LCK_GRANTED

...ahh, but here is an unlock that was being processed at the same time:

           <...>-1059    [003] .....   114.683196: nlmclnt_unlock: addr=192.168.1.137:0 oh=0xc46859cc svid=0x000001ba fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED

...now everything is stuck and it sits there until it's killed:

           <...>-1065    [006] .....   114.683219: nlmclnt_lock: addr=192.168.1.137:0 oh=0x8eff1c83 svid=0x000001bf fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED
           <...>-1063    [001] .....   114.683272: nlmclnt_lock: addr=192.168.1.137:0 oh=0xa0093405 svid=0x000001bd fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED
           <...>-1053    [002] .....   114.683858: nlmclnt_lock: addr=192.168.1.137:0 oh=0x764663fe svid=0x000001c8 fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED
           <...>-1057    [004] .....   114.683870: nlmclnt_lock: addr=192.168.1.137:0 oh=0xa9ffac2f svid=0x000001cb fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED
           <...>-1045    [004] .....   114.683982: nlmclnt_lock: addr=192.168.1.137:0 oh=0x463dc711 svid=0x000001c1 fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED
           <...>-1048    [001] .....   114.683999: nlmclnt_lock: addr=192.168.1.137:0 oh=0x2baa32f2 svid=0x000001c4 fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED
           <...>-1060    [005] .....   114.684028: nlmclnt_lock: addr=192.168.1.137:0 oh=0xb7722046 svid=0x000001bc fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED

...eventually we even end up trying to request it for the same svid as the current holder:

           <...>-1043    [003] .....   114.686412: nlmclnt_lock: addr=192.168.1.137:0 oh=0xd3134d8f svid=0x000001bb fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED

My guess is we're looking at a race on the client side between a grant callback and an unlock. The operations processed on the server in order, but the replies got handled in reverse, such that the newly granted lock got wiped out while processing the unlock reply.

Comment 10 Jeff Layton 2023-03-02 13:12:51 UTC

Ran a capture between the client and server while running this test to get a better picture of what's happening on the wire. The stuck lock was (again) granted via a NLM_GRANTED_MSG callback:

    96840	2023-03-02 07:51:48.014619	192.168.1.137	192.168.1.136	NLM	234	V4 GRANTED_MSG Call (Reply In 96843) FH:0x67be0d7b svid:5698 pos:0-1

...at about the same time, there was a flurry of CANCEL calls to the server, including one from the same svid as in the GRANTED_MSG call:

    96855	2023-03-02 07:51:48.015595	192.168.1.136	192.168.1.137	NLM	786	V4 CANCEL Call (Reply In 96869) FH:0x67be0d7b svid:5698 pos:0-1  ; V4 CANCEL Call (Reply In 96871) FH:0x67be0d7b svid:5713 pos:0-1  ; V4 CANCEL Call (Reply In 96872) FH:0x67be0d7b svid:5716 pos:0-1  ; V4 CANCEL Call (Reply In 96874) FH:0x67be0d7b svid:5714 pos:0-1

I think that gives further creedence to the idea that we have a race between a GRANTED_MSG callback from the server and the wait for the lock being cancelled. This bit has always been difficult to get right, unfortunately.

Comment 11 Jeff Layton 2023-03-03 13:39:49 UTC

Patchset posted upstream that should address this. To test this properly, you'll need a patched client and server:

https://lore.kernel.org/linux-nfs/20230303121603.132103-1-jlayton@kernel.org/

Comment 24 Zhi Li 2023-06-21 00:17:03 UTC

Moving to VERIFIED according to comment#21.

Comment 26 errata-xmlrpc 2023-11-07 08:38:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6583