Bug 2063818
Summary: | NFS lockd become out-of-sync between client & server when client process killed with signal | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Yongcheng Yang <yoyang> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
kernel sub component: | NFS | QA Contact: | Zhi Li <yieli> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | jlayton, nfs-team, xzhou, yoyang |
Version: | 9.0 | Keywords: | Patch, Reproducer, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | kernel-5.14.0-329.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-07 08:38:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yongcheng Yang
2022-03-14 13:08:44 UTC
I've been unable to reproduce this so far with recent centos9 kernels on the server. How long do I need to leave the test program running? Can you try this again with recent kernel on the server and see if it's still reproducible for you? Turns out that server's hang is another problem which has already resolved in Bug 2082816 (kernel 5.14.0-133.el9) now. But the nfsv3 client out-of-sync lockd issue still exist e.g. (5.14.0-139.el9) https://beaker.engineering.redhat.com/jobs/6880428 It should not take much time. In my test, running the simple lock test for 100s and quit. Have just checked this with upstream kernel (5.19.0+) and can also reproduce. E.g. https://beaker.engineering.redhat.com/recipes/12391604#task148308520 I was finally able to reproduce this, using a command like this: tmux new -d 'while true; do timeout 3 ~/test_lockun-1 /mnt/knfsd/stats; sleep 1; done' ...and then killing tmux with a SIGKILL like you were. After there is a stuck lock, I see this pr_warn pop when I shut down the nfs server. lockd: couldn't shutdown host module for net %x!\n" That suggests that this is a server-side bug, since if it were just client-side, we'd be able to shut down lockd. Reproduced it again and turned up nlm debugging before shutting down lockd: [ 912.761564] lockd: nuking all hosts in net f0000000... [ 912.763256] lockd: host garbage collection for net f0000000 [ 912.764270] lockd: nlmsvc_mark_resources for net f0000000 [ 912.765367] nlm_gc_hosts skipping client2 (cnt 2 use 1 exp 4295579981 net f0000000) [ 912.766581] lockd: couldn't shutdown host module for net f0000000! [ 912.767585] lockd: 1 hosts left in net f0000000: [ 912.768233] client2 (cnt 2 use 1 exp 4295579981 net f0000000) [ 912.782619] lockd: release server host client2 [ 912.783831] lockd: shutting down host module [ 912.784991] lockd: nuking all hosts in net 0... [ 912.786215] lockd: host garbage collection for net 0 [ 912.787549] lockd: nlmsvc_mark_resources for net 0 [ 912.788832] lockd: destroy host client2 [ 912.789880] lockd: destroyed nsm_handle for client2 (192.168.1.210) [ 912.791599] lockd_down: service stopped [ 912.793538] lockd_down: service destroyed [ 912.802879] nfsd: last server has exited, flushing export cache So the h_inuse value is high at this point. Working on a patch to determine what's happening there. The problem is that there is still an outstanding lock. I don't see how we'd leave one lingering in the server side code, so I'm actually suspecting an NLM client bug at this point -- one that causes the final UNLOCK call to not happen (possibly because of a fatal pending signal). Trying to consider now how to go about determining this. I may need to add some tracepoints. I think this most likely both a client and server bug. The server doesn't tear down a host's resources properly before shutting down which is what causes the "couldn't shutdown host module" message. That's fairly simple to fix (and I have a patch I'm proposing upstream to fix it). I suspect that the other part of this bug is client-side. Sometimes the client seems to fail to send an UNLOCK message at shutdown time when a process is killed. I haven't quite nailed that bit down yet. For now, let's keep this bug focused on fixing the known server-side issue, and we can clone this later to pursue the client-side issue. Ok, dusting off some lockd tracepoints I had done a while back produced this while running this test. Start at 111s: <...>-1043 [001] ..... 111.748219: nlmclnt_lock: addr=192.168.1.137:0 oh=0x9ebbced4 svid=0x00000000 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED ...then, we get this series of events. First, the lock gets granted and released a couple of times: <...>-1058 [002] ..... 114.659765: nlmclnt_lock: addr=192.168.1.137:0 oh=0xea9e714a svid=0x000001b8 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED <...>-1058 [002] ..... 114.666783: nlmclnt_unlock: addr=192.168.1.137:0 oh=0xea9e714a svid=0x000001b8 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED <...>-960 [000] ..... 114.666839: nlmclnt_grant: addr=192.168.1.137:681 oh=0xfde56509 svid=0x000001b9 fh=0x67be0d7b start=0 len=1 status=LCK_GRANTED <...>-1062 [006] ..... 114.666884: nlmclnt_lock: addr=192.168.1.137:0 oh=0xfde56509 svid=0x000001b9 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED <...>-1062 [006] ..... 114.674724: nlmclnt_unlock: addr=192.168.1.137:0 oh=0xfde56509 svid=0x000001b9 fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED ...then svid=0x1ba gets the lock: <...>-960 [000] ..... 114.674877: nlmclnt_grant: addr=192.168.1.137:681 oh=0xc46859cc svid=0x000001ba fh=0x67be0d7b start=0 len=1 status=LCK_GRANTED <...>-1059 [003] ..... 114.674931: nlmclnt_lock: addr=192.168.1.137:0 oh=0xc46859cc svid=0x000001ba fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED ...then suddenly it gets granted to 0x1bb, even though 0x1ba still holds it. <...>-960 [000] ..... 114.683161: nlmclnt_grant: addr=192.168.1.137:681 oh=0xd3134d8f svid=0x000001bb fh=0x67be0d7b start=0 len=1 status=LCK_GRANTED ...ahh, but here is an unlock that was being processed at the same time: <...>-1059 [003] ..... 114.683196: nlmclnt_unlock: addr=192.168.1.137:0 oh=0xc46859cc svid=0x000001ba fh=0x67be0d7b start=0 len=0 status=LCK_GRANTED ...now everything is stuck and it sits there until it's killed: <...>-1065 [006] ..... 114.683219: nlmclnt_lock: addr=192.168.1.137:0 oh=0x8eff1c83 svid=0x000001bf fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED <...>-1063 [001] ..... 114.683272: nlmclnt_lock: addr=192.168.1.137:0 oh=0xa0093405 svid=0x000001bd fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED <...>-1053 [002] ..... 114.683858: nlmclnt_lock: addr=192.168.1.137:0 oh=0x764663fe svid=0x000001c8 fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED <...>-1057 [004] ..... 114.683870: nlmclnt_lock: addr=192.168.1.137:0 oh=0xa9ffac2f svid=0x000001cb fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED <...>-1045 [004] ..... 114.683982: nlmclnt_lock: addr=192.168.1.137:0 oh=0x463dc711 svid=0x000001c1 fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED <...>-1048 [001] ..... 114.683999: nlmclnt_lock: addr=192.168.1.137:0 oh=0x2baa32f2 svid=0x000001c4 fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED <...>-1060 [005] ..... 114.684028: nlmclnt_lock: addr=192.168.1.137:0 oh=0xb7722046 svid=0x000001bc fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED ...eventually we even end up trying to request it for the same svid as the current holder: <...>-1043 [003] ..... 114.686412: nlmclnt_lock: addr=192.168.1.137:0 oh=0xd3134d8f svid=0x000001bb fh=0x67be0d7b start=0 len=0 status=LCK_BLOCKED My guess is we're looking at a race on the client side between a grant callback and an unlock. The operations processed on the server in order, but the replies got handled in reverse, such that the newly granted lock got wiped out while processing the unlock reply. Ran a capture between the client and server while running this test to get a better picture of what's happening on the wire. The stuck lock was (again) granted via a NLM_GRANTED_MSG callback: 96840 2023-03-02 07:51:48.014619 192.168.1.137 192.168.1.136 NLM 234 V4 GRANTED_MSG Call (Reply In 96843) FH:0x67be0d7b svid:5698 pos:0-1 ...at about the same time, there was a flurry of CANCEL calls to the server, including one from the same svid as in the GRANTED_MSG call: 96855 2023-03-02 07:51:48.015595 192.168.1.136 192.168.1.137 NLM 786 V4 CANCEL Call (Reply In 96869) FH:0x67be0d7b svid:5698 pos:0-1 ; V4 CANCEL Call (Reply In 96871) FH:0x67be0d7b svid:5713 pos:0-1 ; V4 CANCEL Call (Reply In 96872) FH:0x67be0d7b svid:5716 pos:0-1 ; V4 CANCEL Call (Reply In 96874) FH:0x67be0d7b svid:5714 pos:0-1 I think that gives further creedence to the idea that we have a race between a GRANTED_MSG callback from the server and the wait for the lock being cancelled. This bit has always been difficult to get right, unfortunately. Patchset posted upstream that should address this. To test this properly, you'll need a patched client and server: https://lore.kernel.org/linux-nfs/20230303121603.132103-1-jlayton@kernel.org/ Moving to VERIFIED according to comment#21. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6583 |