Bug 1703850
Summary: | [RFE] NFSv4.1 client id trunking in RHEL8 - per share TCP connection with separate RPC queue to same multi-homed NFS server | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jacob Shivers <jshivers> |
Component: | kernel | Assignee: | Benjamin Coddington <bcodding> |
kernel sub component: | NFS | QA Contact: | JianHong Yin <jiyin> |
Status: | CLOSED MIGRATED | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | dwysocha, fsorenso, jiyin, ossantos, rhandlin, seant, smayhew, steved, swhiteho, tech, xzhou, yieli, yoyang |
Version: | 8.0 | Keywords: | FutureFeature, MigratedToJIRA, Reopened, Reproducer, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-09-22 21:10:33 UTC | Type: | Story |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jacob Shivers
2019-04-28 19:34:32 UTC
### Current RHEL 8 behavior ** NFS server ** # uname -r 4.18.0-80.el8.x86_64 # ip -o -4 a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: ens3 inet 192.168.122.110/24 brd 192.168.122.255 scope global dynamic noprefixroute ens3\ valid_lft 2702sec preferred_lft 2702sec 3: ens10 inet 192.168.124.45/24 brd 192.168.124.255 scope global dynamic noprefixroute ens10\ valid_lft 2702sec preferred_lft 2702sec # exportfs -v /test 192.168.122.0/24(sync,wdelay,hide,no_subtree_check,pnfs,sec=sys,rw,insecure,no_root_squash,no_all_squash) /test 192.168.124.0/24(sync,wdelay,hide,no_subtree_check,pnfs,sec=sys,rw,insecure,no_root_squash,no_all_squash) /export 192.168.122.0/24(sync,wdelay,hide,no_subtree_check,pnfs,sec=sys,rw,insecure,no_root_squash,no_all_squash) /export 192.168.124.0/24(sync,wdelay,hide,no_subtree_check,pnfs,sec=sys,rw,insecure,no_root_squash,no_all_squash) ** NFS client ** # uname -r 4.18.0-80.el8.x86_64 # ip -o -4 a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: ens8 inet 192.168.122.72/24 brd 192.168.122.255 scope global dynamic noprefixroute ens8\ valid_lft 3101sec preferred_lft 3101sec 3: ens3 inet 192.168.124.144/24 brd 192.168.124.255 scope global noprefixroute ens3\ valid_lft forever preferred_lft forever # tcpdump -s0 -n -i any -w /tmp/clientid_trunking-rhel8.pcap & # mount 192.168.122.110:/test /mnt/test -o vers=4.1,sec=sys # mount 192.168.124.45:/export /mnt/export -o vers=4.1,sec=sys # awk '$3 ~ /nfs4?$/' /proc/mounts 192.168.122.110:/test /mnt/test nfs4 rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.72,local_lock=none,addr=192.168.122.110 0 0 192.168.124.45:/export /mnt/export nfs4 rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.72,local_lock=none,addr=192.168.122.110 0 0 # ss -no '( dport = :2049 )' | cat Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port tcp ESTAB 0 0 192.168.122.72:999 192.168.122.110:2049 timer:(keepalive,37sec,0) # pkill tcpdump o The NFS client is sending the same verfier, as expected. o The NFS server returns the same clientid, majorid, and scope. o These values have to be the same in order for clientid trunking to occur. $ tshark -tad -n -r clientid_trunking-rhel8.pcap -Y 'nfs.main_opcode == exchange_id' -T fields -e frame.number -e ip.src -e ip.dst -e rpc.msgtyp -e tcp.stream -e nfs.main_opcode -e nfs.clientid -e nfs.verifier4 -e nfs.majorid4 -e nfs.minorid4 -e nfs.scope -e nfs.status -E header=y 2>/dev/null | tr '\t' '-' | column -t -s '-' frame.number ip.src ip.dst rpc.msgtyp tcp.stream nfs.main_opcode nfs.clientid nfs.verifier4 nfs.majorid4 nfs.minorid4 nfs.scope nfs.status 228 192.168.122.72 192.168.122.110 0 5 42 0x1599b152f07373b4 229 192.168.122.110 192.168.122.72 1 5 42 0x5dd6c55c43625d82 7268656c2d38302d616c7068612e6578616d706c652e636f6d 0 7268656c2d38302d616c7068612e6578616d706c652e636f6d 0,0 231 192.168.122.72 192.168.122.110 0 5 42 0x1599b152f07373b4 232 192.168.122.110 192.168.122.72 1 5 42 0x5dd6c55c44625d82 7268656c2d38302d616c7068612e6578616d706c652e636f6d 0 7268656c2d38302d616c7068612e6578616d706c652e636f6d 0,0 631 192.168.124.144 192.168.124.45 0 6 42 0x1599b152f07373b4 632 192.168.124.45 192.168.124.144 1 6 42 0x5dd6c55c44625d82 7268656c2d38302d616c7068612e6578616d706c652e636f6d 0 7268656c2d38302d616c7068612e6578616d706c652e636f6d 0,0 $ sed 's/../&:/g;s/:$//' <<< 7268656c2d38302d616c7068612e6578616d706c652e636f6d | perl -ne 'printf("%s\n", join("", map {chr hex} split(":", $_)));' rhel-80-alpha.example.com o The NFS client does not create a new session and continues accessing the second share with the existing TCP connection in NFSv4.1, unlike NFSv4.0. ### Desired behavior o The client invokes CREATE SESSION allowing each connection to have its own session instead of tearing down the second TCP connection. There's some work in progress upstream to add an nconnect= mount option that allows load balancing across multiple TCP connections (for any NFS version). But they're all to the same IP address; I don't know if it will help in this case: http://marc.info/?i=155917564898.3988.6096672032831115016.stgit@noble.brown We could also think about how to make the client behave in the 4.1 case more like it does in the 4.0 case. It won't be exactly the same, because in the 4.1 case it has to behave as a single client with a lease that is shared between the two connections. But IO operations at least could be distributed between the two connections. I don't know if there would be any disadvantages. Distributing traffic between interfaces by using different IP addresses for different mounts seems a little limited. It would be nice if we could bring any benefits of session or clientid trunking to single mounts as well. On the server side I suppose we could optionally allow a multihomed server to return different major ids, but there's probably some good reason RFC 5661 explicitly forbids this. The correct way to do this might be just to run multiple knfsd's in different containers on the server, one container for each server IP address. We would like to support that later in RHEL8. I had asked Trond about the patches in his multipath_tcp branch at the Bakeathon a few weeks ago and he indicated that people didn't like the nconnect= mount option and he was planning on re-working it so that it would be a tunable. (In reply to J. Bruce Fields from comment #3) > We could also think about how to make the client behave in the 4.1 case more > like it does in the 4.0 case. It won't be exactly the same, because in the > 4.1 case it has to behave as a single client with a lease that is shared > between the two connections. But IO operations at least could be > distributed between the two connections. I don't know if there would be any > disadvantages. An initial disadvantage I could imagine with client_id trunking/shared lease is determining which server IP address a SEQUENCE Call should be sent to. More specifically, if there is a way to cycle between server side IP addresses in the event of a network partition between one of the server side ip addresses and the client side ip address. You could have a situation where the client is only able to communicate with one share at the network level, but SEQUENCEs were sent to the other IP address causing the accessible NFS share to get ERR_EXPIRED due to the lack of lease renewal. Depending on client side recovery mechanisms, you could have a situation where the NFS client can not access any NFS shares with the shared lease because of one server side ip address being inaccessible. It would just require a sufficiently robust client side lease management and recovery system. There has been progress on the nconnect= mount option mentioned in comment 3--see bug 1761352. It should also be possible upstream to run multiple NFS servers in different containers. For that to work in RHEL 8, there's still work to be done--see bug 1365212. I don't know if either of those would address these use cases. We don't currently have a plan to segregate traffic to different connections by mountpoint, and I don't expect that to happen for 8.3. (In reply to J. Bruce Fields from comment #7) > There has been progress on the nconnect= mount option mentioned in comment > 3--see bug 1761352. > > It should also be possible upstream to run multiple NFS servers in different > containers. For that to work in RHEL 8, there's still work to be done--see > bug 1365212. > > I don't know if either of those would address these use cases. > > We don't currently have a plan to segregate traffic to different connections > by mountpoint, and I don't expect that to happen for 8.3. Do you think that this BZ should remain opened or do you think it should be CLOSED as Deferred until there are plans to work on it? It might be useful to understand the customer's use case first. Will nconnect= do the job? If they're just trying to aggregate bandwidth from multiple connections, then that might be sufficient. But perhaps they're trying to seggregate traffic in some other way--e.g. maybe they've got a latency-sensitive application running on one mountpoint, and it's missing deadlines because of bulk transfers on another mount point? (In reply to Jacob Shivers from comment #8) > Do you think that this BZ should remain opened or do you think it should be > CLOSED as Deferred until there are plans to work on it? OK, I guess we should just close. Please feel free to reopen if we get more details about use cases. (In reply to J. Bruce Fields from comment #9) > It might be useful to understand the customer's use case first. Will > nconnect= do the job? If they're just trying to aggregate bandwidth from > multiple connections, then that might be sufficient. But perhaps they're > trying to seggregate traffic in some other way--e.g. maybe they've got a > latency-sensitive application running on one mountpoint, and it's missing > deadlines because of bulk transfers on another mount point? Sorry for the delay in response. Indeed, the customer wants traffic segregation such that the TCP connection for NFS mounts is not multiplexed and they are individualized. In the event that the NFS server has multiple IP addresses, the NFS client can connect to a distinct IP address instead of the connections being combined. Thanks. Any details of their use case would be helpful. Possible approaches, elaborating on comment 3 a little: - As requested here, share the same client but use client or session trunking to associate a different tcp connection with each mount. I guess you could control this with a mount option ("noshareconn"?). I'm not sure how to implement it--would you need multiple rpc clients associated with each nfs client, and a way to look up the correct client to use for a given superblock. - Allow the kernel to act as multiple NFS clients, presenting multiple client_owners to the server. Currently the client_owner is global, and any mount to the same server ends up sharing all its protocol state with every other. But we've discussed supporting multiple clients. This would be necessary, for example, to allow NFS mounts from unprivileged containers. So you'd probably do this by mounting from different containers. - Partition on the server side: you can almost do this now with knfsd: create multiple containers in separate network namespaces, one for each IP address. They won't share any state at all, so tcp connections in particular will all be separate. I say "almost" because the functionality is new, and it only works correctly if you don't try to export the same filesystem from two different containers. I'm not sure if that's required here. nconnect might still be useful as well, even if it doesn't isolate workloads on the two mountpoints from each other. (In reply to J. Bruce Fields from comment #17) > Thanks. Any details of their use case would be helpful. > > Possible approaches, elaborating on comment 3 a little: > > - As requested here, share the same client but use client or session > trunking to associate a different tcp connection with each mount. I guess > you could control this with a mount option ("noshareconn"?). I'm not sure > how to implement it--would you need multiple rpc clients associated with > each nfs client, and a way to look up the correct client to use for a given > superblock. > > - Allow the kernel to act as multiple NFS clients, presenting multiple > client_owners to the server. Currently the client_owner is global, and any > mount to the same server ends up sharing all its protocol state with every > other. But we've discussed supporting multiple clients. This would be > necessary, for example, to allow NFS mounts from unprivileged containers. > So you'd probably do this by mounting from different containers. > > - Partition on the server side: you can almost do this now with knfsd: > create multiple containers in separate network namespaces, one for each IP > address. They won't share any state at all, so tcp connections in > particular will all be separate. I say "almost" because the functionality > is new, and it only works correctly if you don't try to export the same > filesystem from two different containers. I'm not sure if that's required > here. > > nconnect might still be useful as well, even if it doesn't isolate workloads > on the two mountpoints from each other. Options two and three certainly see promising as they would not only address the customer's request, but also extend to different container use cases which has its own benefit. Is there anything the customer can do now to assist in testing/developing these features? I've discussed the use of containers to address the customer's request. They are completely adverse to the use of containers. In your current estimate, do you see what amounts to option #1 being pursued or will the client id trunking most likely be implemented via containers, i.e knfsd or within the client kernel? (In reply to Jacob Shivers from comment #20) > I've discussed the use of containers to address the customer's request. They > are completely adverse to the use of containers. > > In your current estimate, do you see what amounts to option #1 being pursued > or will the client id trunking most likely be implemented via containers, > i.e knfsd or within the client kernel? Of those those options, nconnect and server containerization are the only ones getting any work right now. I'm not aware of anyone actively working on #1 or #2. I do know there's interest in #2, so expect it will happen eventually. So I'm pessimistic about option #1. I've posted upstream to try to gauge interest and make sure I haven't overlooked some simple solution: https://lore.kernel.org/linux-nfs/20201006151335.GB28306@fieldses.org nconnect is coming in 8.3: https://bugzilla.redhat.com/show_bug.cgi?id=1761352. All it does is round-robin RPC's across N TCP connections, which isn't what they're looking for, but perhaps it might happen for their workload that that makes it less likely for one user to consume all the bandwidth. In any case, it's easy to try out, so it might be worth it even if it's a long shot. (In reply to J. Bruce Fields from comment #21) > (In reply to Jacob Shivers from comment #20) > > I've discussed the use of containers to address the customer's request. They > > are completely adverse to the use of containers. > > > > In your current estimate, do you see what amounts to option #1 being pursued > > or will the client id trunking most likely be implemented via containers, > > i.e knfsd or within the client kernel? > > Of those those options, nconnect and server containerization are the only > ones getting any work right now. > > I'm not aware of anyone actively working on #1 or #2. I do know there's > interest in #2, so expect it will happen eventually. > > So I'm pessimistic about option #1. I've posted upstream to try to gauge > interest and make sure I haven't overlooked some simple solution: > https://lore.kernel.org/linux-nfs/20201006151335.GB28306@fieldses.org > > nconnect is coming in 8.3: > https://bugzilla.redhat.com/show_bug.cgi?id=1761352. All it does is > round-robin RPC's across N TCP connections, which isn't what they're looking > for, but perhaps it might happen for their workload that that makes it less > likely for one user to consume all the bandwidth. In any case, it's easy to > try out, so it might be worth it even if it's a long shot. Understood. I've asked the customer if they have done explicit tests with nconnect to see if the "freezing" still persists with mounts. I expressed that this testing is necessary as if the freezing persists, then this helps to inform engineering/upstream that nconnect does not address all use cases of TCP/RPC starvation. (In reply to J. Bruce Fields from comment #21) > I'm not aware of anyone actively working on #1 or #2. I do know there's > interest in #2, so expect it will happen eventually. Actually, Ben Coddington reminds me that some of #2 is actually already done: bf11fbdb20b3 "NFS: Add sysfs support for per-container identifier" is in upstream kernel 5.3. (In RHEL8, kernel patches are on the way, I'm not sure about userspace.) That allows us to use a separate client identifier for each network namespace. Mounts with different client identifiers will be treated as if they're from entirely separate clients, and the TCP connections will no longer be shared. That would require setting up different network namespaces for the two mounts. That doesn't have to be as heavyweight as two entirely different containers, but it still requires some configuration. Ben points out that you can also change the client identifier by writing to /sys/module/nfs/parameters/nfs4_unique_id before doing the mount. So if you mount the first filesystem, then write to that filesystem, then mount the second filesystem, the two mounts should end up with separate TCP connections. You'd have to be careful not to allow the mounts to be done in parallel. I'm really uneasy about depending on that. It feels like something the sysfs interface wasn't really designed to do, and I'm afraid if it broke some day we wouldn't get much sympathy. Ben's following up on that question upstream: https://lore.kernel.org/r/95542179-0C20-4A1F-A835-77E73AD70DB8@redhat.com (In reply to J. Bruce Fields from comment #23) > So if you > mount the first filesystem, then write to that filesystem (Sorry, I meant "write to that sysfs file"). On further examination I think the code isn't really designed to handle changing /sys/module/nfs/parameters/nfs4_unique_id after you already have mounts (it doesn't appear to handle server reboot recovery correctly, for example). So I don't recommend that approach. (In reply to J. Bruce Fields from comment #25) > (it doesn't appear to handle server reboot recovery correctly, for > example). Hm, no, I was wrong about that. (I'm still wouldn't recommend that approach.) (In reply to J. Bruce Fields from comment #17) > Thanks. Any details of their use case would be helpful. As I understand it, the example given by the customer is essentially as follows: the nfs server exports 2 or more directories: /data1 - contains some large files (for example, 10+ GiB) /data2 - contains many small files (several thousand 4 KiB files) the nfs client mounts the filesystems at /data[12] a process starts large writes to a file in /data1 (cp /large/file /data1) another process starts some operations on files in /data2 (cp /small/files* /data2) the large writes to /data1 result in queueing a large number of nfs WRITE rpc_tasks (with a few synchronous operations on either end of the WRITEs): LOOKUP OPEN SETATTR WRITE WRITE WRITE ... WRITE CLOSE GETATTR the operations to create and write 4 KiB files on /data2 result in entirely synchronous rpc_tasks: LOOKUP OPEN SETATTR WRITE CLOSE GETATTR Because there are already a large number of rpc_tasks queued up by the large writes, each of the rpc_tasks generated by the 'small' process must wait to be serviced, causing the 'small' process to progress extremely slowly. So essentially, the customer would like separate queues for each mount, such that the small, synchronous tasks are NOT sequenced behind a large number of already-existing slow tasks. *** Bug 1838422 has been marked as a duplicate of this bug. *** *** Bug 1838723 has been marked as a duplicate of this bug. *** There was an existing "read" test whereby the performance of RHEL8 was compared against RHEL7 and there was a non-trivial performance improvement when using RHEL8: ### RHEL8 # uname -r 4.18.0-193.1.2.el8_2.x86_64 # lsblk /dev/vdc /dev/vdd NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vdc 252:32 0 1T 0 disk vdd 252:48 0 100G 0 disk # mkfs.xfs /dev/vdc meta-data=/dev/vdc isize=512 agcount=4, agsize=67108864 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=268435456, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mkfs.xfs /dev/vdd meta-data=/dev/vdd isize=512 agcount=4, agsize=6553600 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=12800, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mkdir /data{1,2} # mount /dev/vdc /data1 # mount /dev/vdd /data2 # df -ht xfs /data* Filesystem Size Used Avail Use% Mounted on /dev/vdc 1.0T 7.2G 1017G 1% /data1 /dev/vdd 100G 746M 100G 1% /data2 # cd /data1 # for i in {1..8}; do fallocate -l 100g test$i; done # ll total 838860800 -rw-r--r--. 1 root root 107374182400 Oct 8 15:49 test1 -rw-r--r--. 1 root root 107374182400 Oct 8 15:49 test2 -rw-r--r--. 1 root root 107374182400 Oct 8 15:49 test3 -rw-r--r--. 1 root root 107374182400 Oct 8 15:49 test4 -rw-r--r--. 1 root root 107374182400 Oct 8 15:49 test5 -rw-r--r--. 1 root root 107374182400 Oct 8 15:49 test6 -rw-r--r--. 1 root root 107374182400 Oct 8 15:49 test7 -rw-r--r--. 1 root root 107374182400 Oct 8 15:49 test8 # cd /data2 # for i in {1..10000}; do fallocate -l 4k test$i; done # ls -1 /data2 | wc -l 10000 # exportfs -v /data1 192.168.124.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash) /data2 192.168.124.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash) ### NFS client (RHEL 8.3 beta) # uname -r 4.18.0-221.el8.x86_64 # mkdir /mount{1,2} # mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2 # df -ht nfs4 Filesystem Size Used Avail Use% Mounted on nfs-server-8.example.net:/data1 1.0T 808G 217G 79% /mount1 nfs-server-8.example.net:/data2 100G 790M 100G 1% /mount2 # cd /mount2 # time fgrep abc * real 0m7.768s user 0m0.471s sys 0m0.707s # umount /mount1 /mount2 # mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2 # cd /mount1 # while :; do cat * > /dev/null; done # cd /mount2 # time fgrep abc * real 0m11.052s user 0m0.484s sys 0m0.749s # time fgrep abc * real 0m9.035s user 0m0.456s sys 0m0.674s # time fgrep abc * real 0m8.151s user 0m0.458s sys 0m0.632s # umount /mount1 /mount2 # mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2 # cd /mount1 # while :; do cat * > /dev/null; done # cd /mount2 # time fgrep abc * real 0m9.524s user 0m0.465s sys 0m0.687s # time fgrep abc * real 0m5.456s user 0m0.456s sys 0m0.591s # time fgrep abc * real 0m6.063s user 0m0.456s sys 0m0.554s # umount /mount1 /mount2 # mount nfs-server-8.example.net:/data1 /mount1 -o nconnect=16 ; mount nfs-server-8.example.net:/data2 /mount2 -o nconnect=16 # grep xprt /proc/self/mountstats -c 32 # cd /mount1 # while :; do cat * > /dev/null; done # cd /mount2 # time fgrep abc * real 0m8.166s user 0m0.460s sys 0m0.673s # time fgrep abc * real 0m4.510s user 0m0.486s sys 0m0.535s # time fgrep abc * real 0m4.070s user 0m0.424s sys 0m0.488s ### RHEL7 # uname -r 3.10.0-1127.10.1.el7.x86_64 # mkdir /mount{1,2} # mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2 # df -ht nfs4 Filesystem Size Used Avail Use% Mounted on nfs-server-8.example.net:/data1 1.0T 808G 217G 79% /mount1 nfs-server-8.example.net:/data2 100G 790M 100G 1% /mount2 # cd /mount2 # time fgrep abc * # time fgrep abc * real 0m5.443s user 0m0.478s sys 0m0.533s # umount /mount{1,2} # mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2 # cd /mount1 # while :; do cat * > /dev/null; done # cd /mount2 # time fgrep abc * real 8m29.302s user 0m0.833s sys 0m2.080s # cd # cd # umount /mount{1,2} # mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2 # cd /mount1 # while :; do cat * > /dev/null; done # cd /mount2 # time fgrep abc * real 8m53.381s user 0m0.747s sys 0m2.170s Expanding on what Frank said, if the customer were able to mount share1 to ipv1 and share2 to ipv2 with NFSv4.1 and later, this would avoid the synchronous tasks being queued by the slower tasks assuming the behavior was distinct based on each share. Changing the title in an attempt to better describe what is asked for, though is a bit verbose and not sure I hit the mark. Also I think this may need a more explicit reproducer or it could get lost in the weeds. Here is information on my reproducer, and some results. The environment is an nfs server with two IPs, exporting two directories; one directory has a few very large files, and the other has a lot of very small files. The network bandwidth is limited somewhat. write test: configure server with multiple IPs; in this case, I'm using 192.168.122.61 and 192.168.122.161 server# mkdir /data{1,2} server# echo "/data1 *(rw,no_root_squash,sync)" >> /etc/exports server# echo "/data2 *(rw,no_root_squash,sync)" >> /etc/exports server# exportfs -arv client# mkdir /{mount,data}{1,2} client# for i in {1..8} ; do fallocate -l 5G /data1/large_$i ; done client# for i in {1..10000} ; do fallocate -l 4K /data2/small_$i ; done client# mount server:/data1 /mount1 -overs=4.2,sec=sys client# mount server:/data2 /mount2 -overs=4.2,sec=sys *** this portion should be tested and verified... is it necessary? and if so, are these good values *** on client, limit the network bandwidth to both server IPs somewhat: client# IFACE=bond0 client# IP1=192.168.122.61 client# IP2=192.168.122.161 (add a class based queue, and tell the kernel that for calculations, assume that it is a 1 gbit interface) client# tc qdisc add dev $IFACE root handle 1: cbq avpkt 1000 bandwidth 1gbit (add a 5 Mbit class) client# tc class add dev $IFACE parent 1: classid 1:1 cbq rate 5mbit allot 1500 prio 5 bounded isolated (filter which traffic should be shaped) client# tc filter add dev $IFACE parent 1: protocol ip prio 16 u32 match ip dst $IP1 match ip dport 2049 0xffff flowid 1:1 client# tc filter add dev $IFACE parent 1: protocol ip prio 16 u32 match ip dst $IP2 match ip dport 2049 0xffff flowid 1:1 obtain a baseline for copying the 'small_*' files to the nfs mount client# time bash -c 'cp /data2/* /mount2' obtain a baseline for deleting the 'small_*' files from the nfs mount client# time bash -c 'rm -f /mount2/*' on the client, open a second shell and start a loop copying the 'large_*' files to the nfs mount client# while [[ 42 ]] ; do cp -f /data1/* /mount1 ; done back in the client's first terminal, run the actual 'cp' and 'rm' tests client# time bash -c 'cp /data2/* /mount2' client# time bash -c 'rm -f /mount2/*' *** note: these tests will run very slowly, so it might would probably make sense to actually run under 'timeout' and see the progress when that timeout expires. for example: client# time bash -c 'timeout 30m cp /data2/* /mount2' client# time bash -c 'timeout 10m rm -f /mount2/*' here are some results with the test as described in comment 42 # uname -r 3.10.0-1139.el7.x86_64 baselines: # time cp /data2/* /mount2/ real 6m44.090s user 0m0.725s sys 0m20.734s 10,000 files in 6:44 # time bash -c 'rm -f /mount2/*' real 2m8.430s user 0m0.264s sys 0m9.298s 10,000 files in 2:08 actual test (with the 'cp large_*' loop running) (note: I got impatient, and interrupted the tests prematurely, but you can see the progress) # time cp /data2/* /mount2/ ^C real 27m57.309s user 0m0.070s sys 0m0.097s (how much progress was made in those 28 minutes?) time find /mount2 -mindepth 1 -type f | wc -l 16 real 0m33.999s user 0m0.004s sys 0m0.020s so 16 files were created in about 28 minutes (and 'find' on the directory of 16 entries took 34 seconds) # time bash -c 'rm -f /mount2/*' ^C real 8m22.740s user 0m0.008s sys 0m0.034s (how many files were deleted before I got impatient and interrupted?) # time find /mount2 -mindepth 1 -type f | wc -l 2 real 0m29.864s user 0m0.002s sys 0m0.023s so 14 files were deleted in 8 1/2 minutes (and 'find' on the directory of 2 entries took about 30 seconds) RHEL 7 write test with two server IPs: baseline: # uname -r 3.10.0-1139.el7.x86_64 # time bash -c 'cp /data2/* /mount2/' real 6m8.158s user 0m0.881s sys 0m25.488s $ echo 'scale=3; 10000/(6*60+8)' | bc 27.173 (files copied/second) # time bash -c 'rm -f /mount2/small*' real 2m15.574s user 0m0.287s sys 0m10.097s $ echo 'scale=3; 10000/(2*60+15)' | bc 74.074 (files removed/second) # while [[ 42 ]] ; do cp -f /data1/* /mount1/ ; done actual test: # time bash -c 'cp /data2/* /mount2' real 26m13.612s user 0m0.967s sys 0m22.356s # echo "scale=3 ; 10000/1560" | bc 6.410 (files created/second) # time find /mount2 -type f | wc -l 10000 real 2m37.300s user 0m0.165s sys 0m3.247s # time bash -c 'rm -f /mount2/small*' real 10m54.551s user 0m0.327s sys 0m8.564s # echo "scale=3 ; 10000/654" | bc 15.290 (files deleted/second) RHEL 8 write test with two server IPs and nfs v4.2 baseline: # uname -r 4.18.0-193.14.3.el8_2.x86_64 # time bash -c 'cp /data2/* /mount2/' real 5m36.192s user 0m0.757s sys 0m9.365s # echo "scale=3; 10000/(5*60+36.192)" | bc 29.744 (files created/second) # time bash -c 'rm -f /mount2/*' real 2m3.034s user 0m0.219s sys 0m2.957s # echo "scale=3; 10000/(2*60+3.034)" | bc 81.278 (files remoed/second) actual test: # time bash -c 'cp /data2/* /mount2/' ^C real 73m57.468s user 0m0.174s sys 0m1.392s (interrupted) 971 files were created # echo "scale=3; 971/(73*60+57.468)" | bc .218 # time bash -c 'rm -f /mount2/*' real 11m46.588s user 0m0.054s sys 0m0.637s # echo "scale=3; 10000/(11*60+46.588)" | bc 14.152 RHEL 8 write test with two server IPs and nfs v4.2 baseline: # uname -r 4.18.0-193.14.3.el8_2.x86_64 # time bash -c 'cp /data2/* /mount2/' Elapsed time: 331.615211196 - 5:31.615 # echo 'scale=3;10000/331.615211196'|bc 30.155 # time bash -c 'rm -f /mount2/*' Elapsed time: 124.313464477 - 2:04.313 # echo 'scale=3;10000/124.313464477'|bc 80.441 actual test: $ time bash -c 'cp /data2/* /mount2/' ^C real 14m15.001s user 0m0.604s sys 0m5.583s interrupted after 5019 files # echo 'scale=3 ; 5019 / (14*60+15.001)' | bc 5.870 (files created/second) created 10,000 files on server, re-ran 'rm' test (with large copies running): # ./timer bash -c 'rm -f /mount2/*' Elapsed time: 439.417153663 - 7:19.417 user CPU time: 0.377233 - 0.377 sys CPU time: 3.986195 - 3.986 # echo 'scale=3;10000/439.417153663' | bc 22.757 (files deleted/second) results summary in files/second: RHEL 7.9 - nfs v4.0, 2 IPs: cp baseline cp result rm baseline rm result 27.13 6.357 74.074 15.290 RHEL 7.9 - nfs v4.2, 1 IP: 24.752 0.010 78.125 0.027 rhel 8.2 - nfs v4.0, 2 IPs: 30.155 5.870 80.441 22.757 rhel 8.2 - nfs v4.2, 1 IP: 29.744 0.218 81.278 14.152 I also tested using nconnect, and did not see any improvement (I am not finding my results) similar *read* test performed by Jacob Shivers (BZ1703850 comment 30) with single server IP read test results from RHEL 8.3 nfs client: # uname -r 4.18.0-221.el8.x86_64 baseline 'read' test of 10000 'small*' files: # cd /mount2 # time fgrep abc * real 0m7.768s user 0m0.471s sys 0m0.707s unmount and remount to clear out cached pages: # umount /mount1 /mount2 # mount server:/data1 /mount1 ; mount server:/data2 /mount2 the loop reading the 'large*' files: # cd /mount1 # while :; do cat * > /dev/null; done unmount and remount with 'nconnect': # umount /mount1 /mount2 # mount server:/data1 /mount1 -o nconnect=16 ; mount server:/data2 /mount2 -o nconnect=16 # grep xprt /proc/self/mountstats -c 32 # cd /mount1 # while :; do cat * > /dev/null; done # cd /mount2 # time fgrep abc * real 0m8.166s user 0m0.460s sys 0m0.673s # time fgrep abc * real 0m4.510s user 0m0.486s sys 0m0.535s # time fgrep abc * real 0m4.070s user 0m0.424s sys 0m0.488s (** note: these 'time' commands do not include the time required to build the list of files passed to 'time grep' command, since that's expanded by the shell prior to running 'time' -- would be better to do: time bash -c 'time fgrep abc */') read test results from RHEL 7 nfs client: # uname -r 3.10.0-1127.10.1.el7.x86_64 # cd /mount2 # time fgrep abc * real 0m5.443s user 0m0.478s sys 0m0.533s # umount /mount{1,2} # mount server:/data1 /mount1 ; mount server:/data2 /mount2 # cd /mount1 # while :; do cat * > /dev/null; done # cd /mount2 # time fgrep abc * real 8m29.302s user 0m0.833s sys 0m2.080s # umount /mount{1,2} # mount server:/data1 /mount1 ; mount server:/data2 /mount2 # cd /mount1 # while :; do cat * > /dev/null; done # cd /mount2 # time fgrep abc * real 8m53.381s user 0m0.747s sys 0m2.170s I landed here via Bruce's associated NFS mailing list thread.... I am just wondering if this is the same bottleneck as one I tried to describe in my rather epic "NFS re-export" thread: https://marc.info/?l=linux-nfs&m=160077787901987&w=4 Long story short, when you have already read lots of data into a client's pagecache (or fscache/cachefiles), you can't reuse it later until you do some metadata lookups first to validate. If you are reading or writing to the server at the time, these metadata lookups can take longer than if you hadn't bothered caching that data locally. My assumption was that for a single server mount, the queue of operations just gets stuck chugging through the longer read/write ones with long waits between metadata lookups. I also found that mounting a different server entirely and testing metadata to that was so much better even if the network link was being saturated by the read or writes to the other server. I figure I'll find the same if I use a multi-homed server too and the same export on different mount paths. I had hoped that nconnect would provide some extra parallel metadata performance for independent client processes but it wasn't to be. Regards, Daire Upstream feedback is that we need more evidence of exactly where the performance problem is: https://lore.kernel.org/linux-nfs/e06c31e4211cefda52091c7710d871f44dc9160e.camel@hammerspace.com/ "AFAICS Tom Talpey's question is the relevant one. Why is there a performance regression being seen by these setups when they share the same connection? Is it really the connection, or is it the fact that they all share the same fixed-slot session?" I don't have a suggestion for how to test that easily. I don't know if it's useful or relevant, but I found that the "metadata starvation" problem when another process is doing lots of reads or writes to the same mountpoint, was easier to demonstrate when the client network was congested. I could simulate that artificially with an ingress qdisc on the client: # setup the artificial ingress limit modprobe ifb numifbs=1 ip link set dev ifb0 up tc qdisc add dev eth0 ingress tc qdisc add dev ifb0 root handle 1: htb default 10 r2q 4000 tc class add dev ifb0 parent 1: classid :10 htb rate 200mbit tc filter add dev eth0 parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev ifb0 Then mount your NFS server, do lots of reads with one process and walk the remote filesystem with another. Another way to make it even slower is to have multiple threads of simultaneous readers. But even with this low artificial 200mbit limit, I can mount another server and the walk of that filesystem breezes through nice and fast. I imagine it would be the same if it was the same multi-homed server using a different IP. It just seems like you can't have fast bulk IO and fast metadata response from the same mount point at the same time. Daire (In reply to Daire Byrne from comment #51) > I don't know if it's useful or relevant, but I found that the "metadata > starvation" problem when another process is doing lots of reads or writes to > the same mountpoint, was easier to demonstrate when the client network was > congested. It might be useful to add to the upstream thread, but I think it still doesn't answer the question about why this happens. See also https://lore.kernel.org/linux-nfs/20210616011013.50547-1-olga.kornievskaia@gmail.com/T/#t "This patch series attempts to allow for new mounts that are to the same server (ie nfsv4.1+ session trunkable servers) but different network addresses to use connections associated with those mounts but still use the same client structure." Sounds like the same idea that was already rejected, but it'll be interesting to see where this goes. (In reply to J. Bruce Fields from comment #58) > See also > https://lore.kernel.org/linux-nfs/20210616011013.50547-1-olga. > kornievskaia/T/#t > > "This patch series attempts to allow for new mounts that are to the > same server (ie nfsv4.1+ session trunkable servers) but different > network addresses to use connections associated with those mounts > but still use the same client structure." > > Sounds like the same idea that was already rejected, but it'll be > interesting to see where this goes. Did some testing and the patches work as expected, i.e. allowing for NFSv4.1+ to use distinct TCP streams for a given NFS server IP address if specified at mount with the necessary mount option (max_connect). I will note that an existing patch set is required in order to apply the transport changes as noted below. I have not done any additional testing yet, but I will work on that today/tomorrow. # git branch -a | grep nfs remotes/nfs_client/ioctl remotes/nfs_client/ioctl-3.10 remotes/nfs_client/knfsd-devel remotes/nfs_client/linux-next remotes/nfs_client/master remotes/nfs_client/multipath_tcp remotes/nfs_client/testing # grep '"nfs_client"' -A2 .git/config [remote "nfs_client"] url = git://git.linux-nfs.org/projects/trondmy/linux-nfs.git fetch = +refs/heads/*:refs/remotes/nfs_client/* # git checkout nfs_client/linux-next Previous HEAD position was e14c779adebe Merge tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux HEAD is now at 009c9aa5be65 Linux 5.13-rc6 # git checkout -b test_no_collapse Switched to a new branch 'test_no_collapse' # for i in {2..14} ; do wget 'https://lore.kernel.org/linux-nfs/20210608195922.88655-'$i'-olga.kornievskaia@gmail.com/raw' -O sunrpc-$(( i - 1 )).patch ; done # for i in {2..7} ; do wget 'https://lore.kernel.org/linux-nfs/20210616011013.50547-'$i'-olga.kornievskaia@gmail.com/raw' -O nfs-$(( i - 1 )).patch ; done # for i in {1..13} ; do git apply sunrpc-$i.patch ; done # for i in {1..6} ; do git apply nfs-$i.patch ; done # make menuconfig # date; time make -j8; time make -j8 modules; date # date; time make -j8 modules_install; time make -j 8 install; date # grubby --set-default=/boot/vmlinuz-5.13.0-rc6+ # systemctl reboot # uname -r 5.13.0-rc6+ # mkdir /mnt/test{1..3} # getent hosts nfs-server-7.example.net 192.168.124.214 ad-nfs-server.example.net nfs-server-7.example.net 192.168.124.213 ad-nfs-server.example.net nfs-server-7.example.net 192.168.124.130 ad-nfs-server.example.net nfs-server-7.example.net # mount 192.168.124.213:/test1 /mnt/test1 -o vers=4.0 # mount 192.168.124.214:/test2 /mnt/test2 -o vers=4.0 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.124.154:955 192.168.124.214:2049 timer:(keepalive,3.950ms,0) ino:32477 sk:2a1d5780 ESTAB 0 0 192.168.124.154:947 192.168.124.213:2049 timer:(keepalive,5.280ms,0) ino:32471 sk:2b013301 # mount 192.168.124.130:/test3 /mnt/test3 -o vers=4.0 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.124.154:955 192.168.124.214:2049 timer:(keepalive,6.010ms,0) ino:32477 sk:2a1d5780 ESTAB 0 0 192.168.124.154:947 192.168.124.213:2049 timer:(keepalive,7.030ms,0) ino:32471 sk:2b013301 ESTAB 0 0 192.168.124.154:959 192.168.124.130:2049 timer:(keepalive,8.130ms,0) ino:32478 sk:2027ffc3 # umount /mnt/test* # mount 192.168.124.213:/test1 /mnt/test1 -o vers=4.1 # mount 192.168.124.214:/test2 /mnt/test2 -o vers=4.1 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.124.154:788 192.168.124.213:2049 timer:(keepalive,7.270ms,0) ino:32491 sk:77bca981 # umount /mnt/test* # mount 192.168.124.213:/test1 /mnt/test1 -o vers=4.1,max_connect=2 # mount 192.168.124.214:/test2 /mnt/test2 -o vers=4.1,max_connect=2 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.124.154:957 192.168.124.213:2049 timer:(keepalive,7.920ms,0) ino:32505 sk:c4be294b ESTAB 0 0 192.168.124.154:721 192.168.124.214:2049 timer:(keepalive,7.920ms,0) ino:32511 sk:792e9056 # mount 192.168.124.130:/test3 /mnt/test3 -o vers=4.1 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.124.154:957 192.168.124.213:2049 timer:(keepalive,7.750ms,0) ino:32505 sk:c4be294b ESTAB 0 0 192.168.124.154:721 192.168.124.214:2049 timer:(keepalive,7.750ms,0) ino:32511 sk:792e9056 # journalctl | tail -2 Jun 19 18:53:48 git-box-8.example.net kernel: SUNRPC: reached max allowed number (1) did not add transport to server: 192.168.124.214 Jun 19 18:54:40 git-box-8.example.net kernel: SUNRPC: reached max allowed number (2) did not add transport to server: 192.168.124.130 # mount 192.168.124.130:/test3 /mnt/test3 -o vers=4.1,max_connect=3 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.124.154:957 192.168.124.213:2049 timer:(keepalive,7.690ms,0) ino:32505 sk:c4be294b ESTAB 0 0 192.168.124.154:721 192.168.124.214:2049 timer:(keepalive,7.690ms,0) ino:32511 sk:792e9056 # mount nfs-server-7.example.net:/test3 /mnt/test3 -o vers=4.1 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.124.154:957 192.168.124.213:2049 timer:(keepalive,4.760ms,0) ino:32505 sk:c4be294b ESTAB 0 0 192.168.124.154:721 192.168.124.214:2049 timer:(keepalive,4.760ms,0) ino:32511 sk:792e905 # mount 192.168.124.214:/test3 /mnt/test3 -o vers=4.1,max_connect=3 # journalctl | tail -2 Jun 19 18:56:00 git-box-8.example.net kernel: SUNRPC: reached max allowed number (2) did not add transport to server: 192.168.124.130 Jun 19 18:56:41 git-box-8.example.net kernel: RPC: addr 192.168.124.214 already in xprt switch I've seen no upstream response to Olga's patches. Another proposal which would also solve this problem, from Neil Brown: https://lore.kernel.org/linux-nfs/162458475606.28671.1835069742861755259@noble.neil.brown.name/ "It is possible to avoid this sharing by creating a separate network namespace for the new connections, but this can often be overly burdensome. This patch introduces the concept of "NFS namespaces" which allows one group of NFS mounts to be completely separate from others without the need for a completely separate network namespace." (In reply to J. Bruce Fields from comment #60) > I've seen no upstream response to Olga's patches. > > Another proposal which would also solve this problem, from Neil Brown: > > https://lore.kernel.org/linux-nfs/162458475606.28671. > 1835069742861755259.brown.name/ > > "It is possible to avoid this sharing by creating a separate network > namespace for the new connections, but this can often be overly > burdensome. This patch introduces the concept of "NFS namespaces" which > allows one group of NFS mounts to be completely separate from others > without the need for a completely separate network namespace." Hello Bruce and Ben, Olga's patches were included in the upstream kernel in v5.14-rc5-36-g7e13420 and can be readily tested on Fedora Rawhide. SUNRPC keep track of number of transports to unique addresses SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs NFSv4 introduce max_connect mount options SUNRPC enforce creation of no more than max_connect xprts NFSv4.1 add network transport when session trunking is detected # uname -r 5.15.0-0.rc4.20211008git1da38549dd64.36.fc36.x86_64 # mount 192.168.124.213:/test1 /mnt/test1 -o vers=4.1,max_connect=2 # mount 192.168.124.214:/test2 /mnt/test2 -o vers=4.1,max_connect=2 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess ESTAB 0 0 192.168.124.142:915 192.168.124.213:2049 timer:(keepalive,9.557ms,0) ino:24466 sk:1 cgroup:/ <-> ESTAB 0 0 192.168.124.142:713 192.168.124.214:2049 timer:(keepalive,9.549ms,0) ino:24600 sk:2 cgroup:/ <-> I have asked the customer if they would be willing to run a test kernel to help determine if the patch set is ready for inclusion into RHEL. The patch set provides the feature requirement that is the basis for this RFE/BZ. Is there anything that support delivery or the customer can do to help add these features to RHEL 8? I know that RHEL 8.6 is a release that is a mix of stability and new features. It would seem that this may be the last opportunity to include the feature into RHEL8 unless you think RHEL 8.7 would be an option. More than willing to help and I am sure the customer feels the same in order to get these features into a near-term RHEL release. Thanks Hi Jacob, I don't think Olga's work is going to give the customer what they want, because I think the ability to add a new distinct server endpoint doesn't restrict that specific mount to using only that specific endpoint. Instead, it adds the new endpoint connection to the transport switch. That's my reading of the code -- I haven't tested it. Are you seeing that IO on the mount is restricted to only the specific server endpoint? (In reply to Benjamin Coddington from comment #63) > Hi Jacob, I don't think Olga's work is going to give the customer what they > want, because I think the ability to add a new distinct server endpoint > doesn't restrict that specific mount to using only that specific endpoint. > Instead, it adds the new endpoint connection to the transport switch. > That's my reading of the code -- I haven't tested it. > > Are you seeing that IO on the mount is restricted to only the specific > server endpoint? Your reading was correct and I should have tested again. It basically sends IO in a round-robin fashion. I am going to test disconnecting a server interface to see what recovery looks like, i.e. sending IO to the other IP address. I'll update the BZ once I have completed some additional testing. If an interface is dropped/removed from the NFS server while the NFS client is writing, the NFS client is in a loop of sending duplicate ACKs for the interface that is accessible. While it may be possible to assign the IP address for the removed interface to a remaining interface so long as both interfaces are in the same subnet/vlan, this is not really a solution. This is a nice feature for IO distribution and could see additional benefits when coupled with nconnect, this does not address IO isolation per IP address on NFSv4.1+ I am going to go back and review Neil Brown's patch set to see if there have been any changes. 2798 347.605508208 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 8918 Len: 26 2801 347.640395034 192.168.124.213 → 192.168.124.142 NFS 254 V4 Reply (Call In 2798) WRITE 2806 348.641256568 192.168.124.142 → 192.168.124.214 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 8944 Len: 26 2809 348.676883776 192.168.124.214 → 192.168.124.142 NFS 246 V4 Reply (Call In 2806) WRITE 2813 349.678034478 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 8970 Len: 26 2817 349.724954306 192.168.124.213 → 192.168.124.142 NFS 254 V4 Reply (Call In 2813) WRITE 2822 350.725853293 192.168.124.142 → 192.168.124.214 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 8996 Len: 26 2825 350.752136075 192.168.124.214 → 192.168.124.142 NFS 246 V4 Reply (Call In 2822) WRITE 2829 351.752951787 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9022 Len: 26 2833 351.779458628 192.168.124.213 → 192.168.124.142 NFS 254 V4 Reply (Call In 2829) WRITE 2838 352.780406171 192.168.124.142 → 192.168.124.214 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9048 Len: 26 2842 352.860171731 192.168.124.214 → 192.168.124.142 NFS 246 V4 Reply (Call In 2838) WRITE 2846 353.861176709 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9074 Len: 26 2849 353.900166854 192.168.124.213 → 192.168.124.142 NFS 254 V4 Reply (Call In 2846) WRITE 2854 354.901086185 192.168.124.142 → 192.168.124.214 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9100 Len: 26 2858 355.053472681 192.168.124.214 → 192.168.124.142 NFS 246 V4 Reply (Call In 2854) WRITE 2863 356.054473664 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9126 Len: 26 2886 374.949250397 192.168.124.142 → 192.168.124.214 NFS 254 V4 Call SEQUENCE 2888 374.949971516 192.168.124.214 → 192.168.124.142 NFS 218 V4 Reply (Call In 2886) SEQUENCE 3 2.465860428 192.168.124.142 → 192.168.124.214 TCP 66 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073755625 TSecr=2871260084 4 2.466199346 192.168.124.214 → 192.168.124.142 TCP 66 [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871270324 TSecr=3073520106 27 12.705836369 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#1] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073765865 TSecr=2871270324 28 12.706153366 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#1] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871280564 TSecr=3073520106 43 22.945838678 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#2] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073776105 TSecr=2871280564 45 22.946278708 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#2] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871290804 TSecr=3073520106 60 33.185866132 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#3] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073786345 TSecr=2871290804 61 33.186293691 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#3] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871301044 TSecr=3073520106 76 43.425716907 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#4] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073796585 TSecr=2871301044 77 43.425984444 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#4] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871311284 TSecr=3073520106 94 53.665836603 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#5] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073806825 TSecr=2871311284 95 53.666220177 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#5] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871321524 TSecr=3073520106 I went back and applied the test patches from Neil Brown and here are the initial results. Prior to this I had been investigating two different kernel panics, but decided that I should remain focused on the results of the tests. There are two results for the baseline `# cp`, the initial and the second following a `# cp` with the loop running. I believe the second result is more accurate and would seem more inline as the `# cp` should be similar as the other namespace is not in use. +-------------------+---------------------------------------------------------------------------+------------------------------------+------------------------------------+------------------------------------+ | Mount type | Baseline `# cp` | Baseline `# rm` | `# cp` with loop | `# rm` with loop | +-------------------+---------------------------------------------------------------------------+------------------------------------+------------------------------------+------------------------------------+ | Without namespace | 10K 14.4060 files/second || 10K 8.7600 files/second | 10K 69.4229 files/second | 1130 0.6277 files/second | 1208 2.0102 files/second | | With namespace | 10K 06.8001 files/second || 10K 9.7285 files/second | 10K 83.1179 files/second | 3501 1.9442 files/second | 3709 6.1792 files/second | +-------------------+---------------------------------------------------------------------------+------------------------------------+------------------------------------+------------------------------------+ The namespace feature does improve performance, though ideally more tests should be conducted in a more controlled environment. All that being said, a more intelligent RPC queue may be more fruitful to address the limitations of the current FIFO model though that is certainly non-trivial. At this time, I don't know where exactly to proceed. There have been no further comments on the patch proposed from Neil Brown and he admittedly said that the patches were not designed to address the use case they are being tested for as cited below: ---------------------------------------8<-------------------------------------- https://lore.kernel.org/linux-nfs/162513954601.3001.5763461156445846045@noble.neil.brown.name/ > > I'm just wondering if this could also help with the problem described > in this thread: > > https://marc.info/?t=160199739400001&r=2&w=4 Not really a good fit for that problem. ---------------------------------------8<-------------------------------------- As noted later by Neil, it is easier said than done to try and track down where the bottleneck for this issue exists. Any suggestions on what further testing could be done would be well received. The original thread where this issue was brought up stream has remained silent for some time. https://lore.kernel.org/linux-nfs/20201006151335.GB28306@fieldses.org/ ### Setup ### NFS Server # mkfs.xfs /dev/vdc meta-data=/dev/vdc isize=512 agcount=4, agsize=13107200 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=52428800, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=25600, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mount /dev/vdc /exports/ # cd /exports/ # mkdir data{1,2} # chmod 777 * # exportfs -rav exporting *:/exports/data2 exporting *:/exports/data1 # ip -4 -o a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: ens3 inet 192.168.124.138/24 brd 192.168.124.255 scope global noprefixroute ens3\ valid_lft forever preferred_lft forever 3: ens13 inet 192.168.124.20/24 brd 192.168.124.255 scope global dynamic noprefixroute ens13\ valid_lft 525sec preferred_lft 525sec ### NFS client # mkfs.xfs /dev/vdd meta-data=/dev/vdd isize=512 agcount=4, agsize=13107200 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=52428800, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=25600, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mount /dev/vdd /data # mkdir /data/data{1,2} # cd /data/ # chmod 777 * # for i in {1..8} ; do fallocate -l 5G /data/data1/large_$i ; done # for i in {1..10000} ; do fallocate -l 4K /data/data2/small_$i ; done # ip -4 -o a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: ens3 inet 192.168.124.154/24 brd 192.168.124.255 scope global noprefixroute ens3\ valid_lft forever preferred_lft forever 3: ens14 inet 192.168.124.25/24 brd 192.168.124.255 scope global dynamic noprefixroute ens14\ valid_lft 3046sec preferred_lft 3046sec # IFACE=ens14 # IP1=192.168.124.20 # IP2=192.168.124.138 # tc qdisc add dev $IFACE root handle 1: cbq avpkt 1000 bandwidth 1gbit # tc filter add dev $IFACE parent 1: protocol ip prio 16 u32 match ip dst $IP1 match ip dport 2049 0xffff flowid 1:1 # tc filter add dev $IFACE parent 1: protocol ip prio 16 u32 match ip dst $IP2 match ip dport 2049 0xffff flowid 1:1 ### Without namespace mount option # mount 192.168.124.20:/exports/data1 /mount1 -overs=4.2,sec=sys -vv mount.nfs: timeout set for Tue Nov 23 10:21:55 2021 mount.nfs: trying text-based options 'vers=4.2,sec=sys,addr=192.168.124.20,clientaddr=192.168.124.25' 192.168.124.20:/exports/data1 on /mount1 type nfs (rw,vers=4.2,sec=sys) # mount 192.168.124.20:/exports/data2 /mount2 -overs=4.2,sec=sys -vv mount.nfs: timeout set for Tue Nov 23 10:22:11 2021 mount.nfs: trying text-based options 'vers=4.2,sec=sys,addr=192.168.124.20,clientaddr=192.168.124.25' 192.168.124.20:/exports/data2 on /mount2 type nfs (rw,vers=4.2,sec=sys) # grep nfs4 /proc/self/mounts 192.168.124.20:/exports/data1 /mount1 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.124.25,local_lock=none,addr=192.168.124.20 0 0 192.168.124.20:/exports/data2 /mount2 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.124.25,local_lock=none,addr=192.168.124.20 0 0 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess ESTAB 0 0 192.168.124.25:884 192.168.124.20:2049 timer:(keepalive,8.530ms,0) ino:32906 sk:9fbeb9ae * Baseline `# cp` # time bash -c 'timeout 30m /usr/bin/cp -f /data/data2/* /mount2' real 11m34.223s user 0m0.388s sys 0m4.313s * Files created per second $ echo 'scale=4; 10001 / ((11 * 60) + 34.223)' | bc 14.4060 * Baseline `# rm` # time bash -c 'timeout 10m /usr/bin/rm -f /mount2/*' real 2m24.059s user 0m0.147s sys 0m1.279s * Files removed per second $ echo 'scale=4; 10001 / ((2 * 60) + 24.059)' | bc 69.4229 # while [[ 42 ]] ; do /usr/bin/cp -f /data/data1/* /mount1 ; done # time bash -c 'timeout 30m /usr/bin/cp -f /data/data2/* /mount2' real 30m0.055s user 0m0.113s sys 0m1.839s # ll /mount2 | wc -l 1130 * Files created per second $ echo 'scale=4; 1130 / ((30 * 60) + 0.055)' | bc .6277 *** break loop *** # time bash -c 'timeout 30m /usr/bin/cp -f /data/data2/* /mount2' real 19m11.659s user 0m0.353s sys 0m4.607s # time bash -c 'timeout 10m /usr/bin/rm -f /mount2/*' real 10m0.933s user 0m0.076s sys 0m1.489s # ll /mount2 | wc -l 8793 * Files removed $ echo $(( 10001 - 8793 )) 1208 * Files removed per second $ echo 'scale=4; 1208 / ((10 * 60) + 0.933)' | bc 2.0102 ### With namespace mount option # mount 192.168.124.20:/exports/data1 /mount1 -overs=4.2,sec=sys,namespace=data1,clientaddr=192.168.124.25 -vv mount.nfs: timeout set for Mon Nov 22 14:55:44 2021 mount.nfs: trying text-based options 'vers=4.2,sec=sys,namespace=data1,clientaddr=192.168.124.25,addr=192.168.124.20' 192.168.124.20:/exports/data1 on /mount1 type nfs (rw,vers=4.2,sec=sys,namespace=data1,clientaddr=192.168.124.25) # mount 192.168.124.138:/exports/data2 /mount2 -overs=4.2,sec=sys,namespace=data2,clientaddr=192.168.124.154 -vv mount.nfs: timeout set for Mon Nov 22 14:56:22 2021 mount.nfs: trying text-based options 'vers=4.2,sec=sys,namespace=data2,clientaddr=192.168.124.154,addr=192.168.124.138' 192.168.124.138:/exports/data2 on /mount2 type nfs (rw,vers=4.2,sec=sys,namespace=data2,clientaddr=192.168.124.154) # grep nfs4 /proc/self/mounts 192.168.124.20:/exports/data1 /mount1 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,namespace=data1,timeo=600,retrans=2,sec=sys,clientaddr=192.168.124.25,local_lock=none,addr=192.168.124.20 0 0 192.168.124.138:/exports/data2 /mount2 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,namespace=data2,timeo=600,retrans=2,sec=sys,clientaddr=192.168.124.154,local_lock=none,addr=192.168.124.138 0 0 # ss -ptone '( dport = :2049 )' | cat State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess ESTAB 0 0 192.168.124.25:767 192.168.124.138:2049 timer:(keepalive,3.530ms,0) ino:32330 sk:6f6cdf90 ESTAB 0 0 192.168.124.25:1020 192.168.124.20:2049 timer:(keepalive,7.630ms,0) ino:32327 sk:949111d8 # time bash -c 'timeout 30m cp /data/data2/* /mount2' real 24m30.697s user 0m0.328s sys 0m3.998s * Files created per second $ echo 'scale=4; 10001 / ((24 * 60) + 30.697)' | bc 6.8001 # time bash -c 'timeout 10m rm -f /mount2/*' real 2m0.323s user 0m0.110s sys 0m1.490s * Files removed per second $ echo 'scale=4; 10001 / ((2* 60) + 0.323)' | bc 83.1179 # while [[ 42 ]] ; do /usr/bin/cp -f /data/data1/* /mount1 ; done # time bash -c 'timeout 30m cp /data/data2/* /mount2' real 30m0.656s user 0m0.289s sys 0m2.178s # ll /exports/data2/* | wc -l 3501 * Files created per second $ echo 'scale=4; 3501 / ((30 * 60) + 0.656)' | bc 1.9442 *** stop the loop *** # time bash -c 'timeout 30m /usr/bin/cp -f /data/data2/* /mount2' real 17m18.001s user 0m0.376s sys 0m3.526s * Files created per second $ echo 'scale=4; 10001 / ((17 * 60) + 18.001)' | bc 9.6348 # time bash -c 'timeout 10m rm -f /mount2/*' real 10m0.231s user 0m0.116s sys 0m0.967s # ll /exports/data2/* | wc -l 6292 * 3709 Files deleted in 10 minutes. $ echo $(( 10001 - 6292 )) 3709 * Deleted per second $ echo 'scale=4; 3709 / ((10 * 60) + 0.231)' | bc 6.1792 I just thought I'd add, that when I also tested Neil Brown's patches, I only saw minimal improvement in my testing. My test was similar except I was simply reading files to /dev/null in one namespace (maxing the network) and doing an "ls -lR" in another namespace mount. In your use case, you would be happy with a separate mount that could be used for fast metadata intensive workloads (scanning a fileystem), but in my case I would prefer that there was a means to have single mount that could "prefer" or prioritise the smaller metadata lookups. I'm not sure how "ls -l" and readdirplus lookups would fit into such a priority scheme. Ultimately, we want to be able to cache huge datasets on a client (either in pagecache or fscache) but before we can use that cached data (accumulated over days), we need to validate the cache and see if the file changed. And it is the slowness of these lookups on a busy read/write client that hurts our caching performance. I would love to help identify the bottleneck or test any solutions, but I really don't know where to start. Daire (In reply to Daire Byrne from comment #67) > I just thought I'd add, that when I also tested Neil Brown's patches, I only > saw minimal improvement in my testing. > > My test was similar except I was simply reading files to /dev/null in one > namespace (maxing the network) and doing an "ls -lR" in another namespace > mount. > > In your use case, you would be happy with a separate mount that could be > used for fast metadata intensive workloads (scanning a fileystem), but in my > case I would prefer that there was a means to have single mount that could > "prefer" or prioritise the smaller metadata lookups. I'm not sure how "ls > -l" and readdirplus lookups would fit into such a priority scheme. > > Ultimately, we want to be able to cache huge datasets on a client (either in > pagecache or fscache) but before we can use that cached data (accumulated > over days), we need to validate the cache and see if the file changed. And > it is the slowness of these lookups on a busy read/write client that hurts > our caching performance. > > I would love to help identify the bottleneck or test any solutions, but I > really don't know where to start. > > Daire Hello Daire, Thank you for reaching out. I had previously considered emailing you to see what kind of results you had observed in your own testing :) My primary daily focus has shifted so I generally spend less time on NFS issues. That being said, even if I had more time to investigate, this issue does largely seem non-trivial to move forward. I intend to go back and review responses from the email thread that started this conversation. Can you state if you have done any further testing or have any additional observations at this time? Thanks, Jacob Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |