Bug 2176575 - intermittent severe NFS client performance drop via nfs_server_reap_expired_delegations looping?
Summary: intermittent severe NFS client performance drop via nfs_server_reap_expired_d...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 39
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-08 18:28 UTC by Frank Ch. Eigler
Modified: 2024-03-04 11:49 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-03-04 11:49:47 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Frank Ch. Eigler 2023-03-08 18:28:42 UTC
We have a f37 client (kernel 6.1.14-200.fc37.x86_64) connected to an up-to-date
Synology NFS server, connected via nfs4.  Intermittently, a condition develops
whereby the client speed falls off a cliff.  Normal NFS ops are instantaneous,
but when this condition develops, ordinary ls operations can take order of
seconds, firefox startup taking 30.

mount reports active options thusly:
(rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.1,local_lock=none,addr=192.168.1.21)

Other clients to the same server are not affected at the same time.  No 
contemporaneous diagnostics visible on the affected client, other clients,
or the server.  So it's hard to identify the point in time of the beginning
of the condition.

A "perf top" on the kernel displays this peculiarity:

Samples: 6M of event 'cycles', 4000 Hz, Event count (approx.): 38276948048 lost: 0/0 drop: 0/0
Overhead  Shared Object                                   Symbol
  49.85%  [kernel]                                        [k] nfs_server_reap_expired_delegations                   ◆
   4.47%  [kernel]                                        [k] nfs_mark_test_expired_all_delegations                 ▒
   1.79%  [kernel]                                        [k] add_interrupt_randomness                              ▒
   0.87%  [kernel]                                        [k] check_preemption_disabled                             ▒
   0.86%  [kernel]                                        [k] read_tsc                                              ▒

i.e., nfs_server_reap_expired_delegations() is very very busy.

iptraf-ng reports O(1000) packets per second to the server on a relatively
quiescent state (so no active file activities), most of them appear to be
getattr queries/responses.  O(3000)+ such packets if NFS activities are
being attempted.

13:21:56.343792 IP (tos 0x2,ECT(0), ttl 64, id 43049, offset 0, flags [DF], proto TCP (6), length 268)
    CLIENT > SERVER: Flags [P.], cksum 0x8465 (incorrect -> 0x0852), seq 74016:74232, ack 34429, win 24571, options [nop,nop,TS val 1821171799 ecr 1844073712], length 216: NFS request xid 2025763571 212 getattr fh 0,1/53
13:21:56.343997 IP (tos 0x2,ECT(0), ttl 64, id 6274, offset 0, flags [DF], proto TCP (6), length 152)
    SERVER > CLIENT: Flags [P.], cksum 0xe7ab (correct), seq 34429:34529, ack 74232, win 24576, options [nop,nop,TS val 1844073713 ecr 1821171799], length 100: NFS reply xid 2025763571 reply ok 96 getattr NON 2 ids 0/-777130909 sz 469762048

umount -l / mount of the same filesystem while the machine stays up does not
appear to help matters.  Once the condition begins, there appears to be no
stopping it, even with a server reboot.  A client reboot does fix things
right up, unsurprisingly. :-)

Comment 1 Frank Ch. Eigler 2023-03-08 18:47:09 UTC
wireshark indicates NFSv4 TEST_STATEID packets

Frame 8: 282 bytes on wire (2256 bits), 282 bytes captured (2256 bits) on interface br0, id 0
Ethernet II, Src: 86:59:3e:3f:64:1d (86:59:3e:3f:64:1d), Dst: Synology_b2:55:91 (00:11:32:b2:55:91)
Internet Protocol Version 4, Src: 192.168.1.1, Dst: 192.168.1.21
Transmission Control Protocol, Src Port: 678, Dst Port: 2049, Seq: 649, Ack: 401, Len: 216
Remote Procedure Call, Type:Call XID:0x542ce2f3
Network File System, Ops(2): SEQUENCE, TEST_STATEID

No.     Time           Source                Destination           Protocol Info                                                            Length
      9 0.001713316    192.168.1.21          192.168.1.1           NFS      V4 Reply (Call In 8) TEST_STATEID                               166

Frame 9: 166 bytes on wire (1328 bits), 166 bytes captured (1328 bits) on interface br0, id 0
Ethernet II, Src: Synology_b2:55:91 (00:11:32:b2:55:91), Dst: 86:59:3e:3f:64:1d (86:59:3e:3f:64:1d)
Internet Protocol Version 4, Src: 192.168.1.21, Dst: 192.168.1.1
Transmission Control Protocol, Src Port: 2049, Dst Port: 678, Seq: 401, Ack: 865, Len: 100
Remote Procedure Call, Type:Reply XID:0x542ce2f3
Network File System, Ops(2): SEQUENCE TEST_STATEID

No.     Time           Source                Destination           Protocol Info                                                            Length
     10 0.001966063    192.168.1.1           192.168.1.21          NFS      V4 Call (Reply In 11) TEST_STATEID                              282

Comment 2 Frank Ch. Eigler 2023-03-17 03:52:31 UTC
The presence of a sysctl.d setting to turn fs.leases-enable=0 does not prevent this problem.

Comment 3 Frank Ch. Eigler 2023-07-17 20:40:45 UTC
fedora38 kernel 6.3.8-200 does not fix this issue.  Once it hits (after something like 3 weeks of uptime and no other known changes on the network), nfsstat -c indicates 2-30 thousands of test_stateid messages per second, and the [IP-manager] kernel thread occupying the bulk of cpu time.

Comment 4 Trond Myklebust 2023-07-17 21:04:57 UTC
The above is expected to occur if/when the NFSv4.1 server indicates that stateids have been administratively revoked, or the lease was expired by returning one of the errors NFS4ERR_DELEG_REVOKED, NFS4ERR_ADMIN_REVOKED, or NFS4ERR_EXPIRED.
It can also occur if the server sets one of the SEQUENCEID flags SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or SEQ4_STATUS_ADMIN_STATE_REVOKED.

In all those cases, the client is required by the spec to iterate through its list of known stateids, and call TEST_STATEID to ascertain that the stateid is still valid.

Comment 5 Frank Ch. Eigler 2023-07-17 21:35:22 UTC
There appears to be no trace of such a message coming from the server: either in its own logs, that of the client, or any abnormality in any of the other clients.  Further, once this situation begins, the iteration through the list of known stateids seems to be perpetual & continuous until the reboot of the machine.  It does not seem like a one-pass query.

Comment 6 Trond Myklebust 2023-07-17 22:35:25 UTC
Do you have logs to post?

Comment 7 Frank Ch. Eigler 2023-07-17 23:58:37 UTC
> Do you have logs to post?

I was going to say no, because the client logged nothing.
However!  Yet another peer NFS client (not 192.168.1.1), which continued
performing normally, dmesg'd this about the NFS server:

[379489.954753] NFS: server HOSTNAME error: fileid changed
                fsid 0:60: expected fileid 0x9fbe8fe, got 0x9fbe903
[380399.949813] NFS: server HOSTNAME error: fileid changed
                fsid 0:60: expected fileid 0x9fbede6, got 0x9fbede7
[380403.949741] NFS: server HOSTNAME error: fileid changed
                fsid 0:60: expected fileid 0x9fbedeb, got 0x9fbeded
[380428.949556] NFS: server HOSTNAME error: fileid changed
                fsid 0:60: expected fileid 0x9fbee06, got 0x9fbee0a
[380518.949379] NFS: server HOSTNAME error: fileid changed
                fsid 0:60: expected fileid 0x9fbee52, got 0x9fbee53
[380548.949321] NFS: server HOSTNAME error: fileid changed
                fsid 0:60: expected fileid 0x9fbee69, got 0x9fbee6a
[384159.931317] NFS: server HOSTNAME error: fileid changed
                fsid 0:60: expected fileid 0x9fbfd6d, got 0x9fbfd6e

and the NFS server turns out to have dmesg'd this:

[4259331.428885] nfsd4_validate_stateid: 443 callbacks suppressed
[4259331.434869] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.443101] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.451501] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.459626] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.467707] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.475864] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.483976] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.492077] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.500249] NFSD: client 192.168.1.1 testing state ID with incorrect client ID
[4259331.508330] NFSD: client 192.168.1.1 testing state ID with incorrect client ID

These may have been at around the same time.  There were no other log entries, just these ~10, but the problem persisted for several hours, till a reboot.

If the problem should occur again, is there some tracing or debugging data you suggest gathering, from either client or the server?

Comment 8 Trond Myklebust 2023-07-18 00:25:30 UTC
Do these clients have unique hostnames and/or do they have NFSv4 uniquefiers set?

Comment 9 Frank Ch. Eigler 2023-07-18 00:34:08 UTC
(In reply to Trond Myklebust from comment #8)
> Do these clients have unique hostnames and/or do they have NFSv4 uniquefiers
> set?

Unique hostnames & IP addresses, yes, and a nearby arpwatch doesn't detect any shenanigans.

By uniquefier, do you mean a manually set /sys/fs/nfs/net/nfs_client/identifier?  No, just
the default "(null)" appears.

Comment 10 Trond Myklebust 2023-07-18 00:49:55 UTC
Well..

'FS: server <wwww> error: fileid changed fsid <xxxx>: expected fileid <yyyy>, got <zzzz>' implies that the server is having trouble resolving filehandles correctly. It is returning files with inode numbers that do not match those which were originally returned when the file was looked up by name. This will obviously confuse clients, since they expect filehandles to always be unique.

The message 'NFSD: client <aaaa> testing state ID with incorrect client ID' was removed by commit 663e36f07666 a couple of years ago, since (as I said above) it is not unexpected for clients to have to test recover after a network partition causes the lease to be lost. So that would be the smoking gun for the 'NFS4ERR_EXPIRED' return value from the server.

Comment 11 Trond Myklebust 2023-07-18 00:52:39 UTC
Correction: the 'testing state ID with incorrect client ID' will lead to a NFS4ERR_BAD_STATEID being returned. That has the same effect as the NFS4ERR_EXPIRED.

Comment 12 Frank Ch. Eigler 2023-07-18 00:56:23 UTC
Thanks for your help.  I've added modprobe.d nfs nfs4_unique_id=FOO configurations to all the clients, in hope of avoiding identifier collisions, if that is somehow being caused presently.

Comment 13 Trond Myklebust 2023-07-18 01:08:21 UTC
If you had unique hostnames on the clients, then the nfs4_unique_id setting is not necessary.

However this report has me wondering about the whole nfsd4_validate_stateid thingy in the server. If the server is testing the stateid for consistency with the new client id before it has even tried to call find_stateid_locked(), then how are we ever going to clear out those stateids? The client is not supposed to call FREE_STATEID if the TEST_STATEID call returns NFS4ERR_BAD_STATEID.

So this might explain the loop: the server is expecting that FREE_STATEID in order to free up the no-longer-valid stateid, but that will never happen because the result from TEST_STATEID told the client that the stateid is bad. This again means the server cannot clear the SEQUENCEID flags, and so we're in for another round of TEST_STATEID. Lather-rinse-repeat...

Comment 14 Benjamin Coddington 2023-08-04 15:28:57 UTC
I think there's a server bug that can cause this, but I've been unable to find a way for the client to tickle it in practice:
https://lore.kernel.org/linux-nfs/c0fe2b35900938943048340ef70fc12282fe1af8.1691160604.git.bcodding@redhat.com/T/#u

Sure would be nice to see a wire capture of the client being able to get into this state.  In a different report of this problem (bug 2217103), the client had 75k delegations and multiple network partitions, but we aren't able to examine the server's state or see what happened on the wire to create this situation.

Comment 15 Benjamin Coddington 2023-08-04 15:33:14 UTC
(In reply to Frank Ch. Eigler from comment #0)
> Once the condition begins, there appears to be no
> stopping it, even with a server reboot.  A client reboot does fix things
> right up, unsurprisingly. :-)

Oh, nevermind on comment 14.  A server reboot should fix the problem that the patch in comment 14 also fixes.

Comment 16 Frank Ch. Eigler 2024-01-03 16:04:35 UTC
The same or a related problem appears to be plaguing Fedora 39's nfs client.  In a configuration where a RHEL9 nfs-server is dealing with a F39 nfs client, we're getting to a state where a kernel thread is spinning, and NFS traffic is slow.

6.6.4-200.fc39.x86_64


  96.17%  [kernel]                           [k] nfs_server_return_marked_delegations                         ◆
   1.49%  [kernel]                           [k] nfs4_is_valid_delegation.part.0                              ▒
   0.11%  [kernel]                           [k] update_sd_lb_stats.constprop.0                               ▒
   0.09%  [kernel]                           [k] native_write_msr                                             ▒
   0.07%  [kernel]                           [k] cpuidle_enter_state                                          ▒
   0.07%  [kernel]                           [k] nfs_expire_unused_delegation_types                           ▒
   0.05%  [kernel]                           [k] native_irq_return_iret                                       ▒
   0.04%  [kernel]                           [k] menu_select                                                  ▒

Comment 17 Benjamin Coddington 2024-03-04 11:49:47 UTC
This should be fixed by upstream
a9b8d90f8726 NFSv4: fairly test all delegations on a SEQ4_ revocation

on v6.7 kernel.


Note You need to log in before you can comment on or make changes to this bug.