Bug 1703850

Summary:	[RFE] NFSv4.1 client id trunking in RHEL8 - per share TCP connection with separate RPC queue to same multi-homed NFS server
Product:	Red Hat Enterprise Linux 8	Reporter:	Jacob Shivers <jshivers>
Component:	kernel	Assignee:	Benjamin Coddington <bcodding>
kernel sub component:	NFS	QA Contact:	JianHong Yin <jiyin>
Status:	CLOSED MIGRATED	Docs Contact:
Severity:	medium
Priority:	medium	CC:	dwysocha, fsorenso, jiyin, ossantos, rhandlin, seant, smayhew, steved, swhiteho, tech, xzhou, yieli, yoyang
Version:	8.0	Keywords:	FutureFeature, MigratedToJIRA, Reopened, Reproducer, Triaged
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-22 21:10:33 UTC	Type:	Story
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jacob Shivers 2019-04-28 19:34:32 UTC

1. Proposed title of this feature request

o NFSv4.1 Cliend ID trunking

2. Who is the customer behind the request?

Account:
TAM customer: no
CSM customer: no
Strategic: no

3. What is the nature and description of the request?

o Add additional capabilites/functionality to the shipped kernel in RHEL to allow
for NFS clients to have multiple TCP connections to the same NFS server, but for different shares,
with NFSv4.1.

4. Why does the customer need this? (List the business requirements here)

o NFSv4.1 loses the functionality present in NFSv4.0 to allow for shares to be mounted
on a per TCP basis, provided the NFS server is multi-homed. Separate TCP connections
are made for each share to prevent one NFS share from consumming all of the available
bandwidth. This is needed as there is no elevator/scheduler for RPC and there is only one RPC queue
per TCP connection. The only available means to prevent TCP multiplexing across multiple
shares is to connect to each share with a separate destination IP address.

5. How would the customer like to achieve this? (List the functional requirements here)

o Changes to the shipped kernel and userspace where appropriate.

6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.

o Mount a NFSv4.1 share to one IP address of a NFS server then mount a second NFSv4.1 share to the second IP address
of the NFS server. The second share should have its own TCP connection and not share the TCP connection of the
first share.

7. Is there already an existing RFE upstream or in Red Hat Bugzilla?

o No.

8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?

o RHEL8

9. Is the sales team involved in this request and do they have any additional input?

o No

10. List any affected packages or components.

o kernel -> fs/nfs and net/sunrpc code.
o nfsutils -> adding userspace functionality if necessary.

11. Would the customer be able to assist in testing this functionality if implemented?

o Yes.

Comment 1 Jacob Shivers 2019-04-28 19:36:52 UTC

### Current RHEL 8 behavior

 ** NFS server **

# uname -r
4.18.0-80.el8.x86_64

# ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens3    inet 192.168.122.110/24 brd 192.168.122.255 scope global dynamic noprefixroute ens3\       valid_lft 2702sec preferred_lft 2702sec
3: ens10    inet 192.168.124.45/24 brd 192.168.124.255 scope global dynamic noprefixroute ens10\       valid_lft 2702sec preferred_lft 2702sec

# exportfs -v
/test           192.168.122.0/24(sync,wdelay,hide,no_subtree_check,pnfs,sec=sys,rw,insecure,no_root_squash,no_all_squash)
/test           192.168.124.0/24(sync,wdelay,hide,no_subtree_check,pnfs,sec=sys,rw,insecure,no_root_squash,no_all_squash)
/export         192.168.122.0/24(sync,wdelay,hide,no_subtree_check,pnfs,sec=sys,rw,insecure,no_root_squash,no_all_squash)
/export         192.168.124.0/24(sync,wdelay,hide,no_subtree_check,pnfs,sec=sys,rw,insecure,no_root_squash,no_all_squash)

 ** NFS client **

# uname -r
4.18.0-80.el8.x86_64

# ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens8    inet 192.168.122.72/24 brd 192.168.122.255 scope global dynamic noprefixroute ens8\       valid_lft 3101sec preferred_lft 3101sec
3: ens3    inet 192.168.124.144/24 brd 192.168.124.255 scope global noprefixroute ens3\       valid_lft forever preferred_lft forever

# tcpdump -s0 -n -i any -w /tmp/clientid_trunking-rhel8.pcap &

# mount 192.168.122.110:/test /mnt/test -o vers=4.1,sec=sys
# mount 192.168.124.45:/export /mnt/export -o vers=4.1,sec=sys

# awk '$3 ~ /nfs4?$/' /proc/mounts
192.168.122.110:/test /mnt/test nfs4 rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.72,local_lock=none,addr=192.168.122.110 0 0
192.168.124.45:/export /mnt/export nfs4 rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.72,local_lock=none,addr=192.168.122.110 0 0

# ss -no '( dport = :2049 )' | cat
Netid State  Recv-Q   Send-Q       Local Address:Port        Peer Address:Port
tcp   ESTAB  0        0           192.168.122.72:999      192.168.122.110:2049   timer:(keepalive,37sec,0)

# pkill tcpdump

 o The NFS client is sending the same verfier, as expected.
 o The NFS server returns the same clientid, majorid, and scope.
 o These values have to be the same in order for clientid trunking to occur.
$ tshark -tad -n -r clientid_trunking-rhel8.pcap -Y 'nfs.main_opcode == exchange_id' -T fields -e frame.number -e ip.src -e ip.dst -e rpc.msgtyp -e tcp.stream -e nfs.main_opcode -e nfs.clientid -e nfs.verifier4 -e nfs.majorid4 -e nfs.minorid4 -e nfs.scope -e nfs.status -E header=y 2>/dev/null | tr '\t' '-' | column -t -s '-'
frame.number  ip.src           ip.dst           rpc.msgtyp  tcp.stream  nfs.main_opcode  nfs.clientid        nfs.verifier4       nfs.majorid4                                        nfs.minorid4  nfs.scope                                           nfs.status
228           192.168.122.72   192.168.122.110  0           5           42                                   0x1599b152f07373b4                                                                                     
229           192.168.122.110  192.168.122.72   1           5           42               0x5dd6c55c43625d82                      7268656c2d38302d616c7068612e6578616d706c652e636f6d  0             7268656c2d38302d616c7068612e6578616d706c652e636f6d  0,0
231           192.168.122.72   192.168.122.110  0           5           42                                   0x1599b152f07373b4                                                                                     
232           192.168.122.110  192.168.122.72   1           5           42               0x5dd6c55c44625d82                      7268656c2d38302d616c7068612e6578616d706c652e636f6d  0             7268656c2d38302d616c7068612e6578616d706c652e636f6d  0,0
631           192.168.124.144  192.168.124.45   0           6           42                                   0x1599b152f07373b4                                                                                     
632           192.168.124.45   192.168.124.144  1           6           42               0x5dd6c55c44625d82                      7268656c2d38302d616c7068612e6578616d706c652e636f6d  0             7268656c2d38302d616c7068612e6578616d706c652e636f6d  0,0

$ sed 's/../&:/g;s/:$//' <<< 7268656c2d38302d616c7068612e6578616d706c652e636f6d | perl -ne 'printf("%s\n", join("", map {chr hex} split(":", $_)));'
rhel-80-alpha.example.com

 o The NFS client does not create a new session and continues accessing the second share with the existing TCP
   connection in NFSv4.1, unlike NFSv4.0.

### Desired behavior

 o The client invokes CREATE SESSION allowing each connection to have its own session instead of tearing down
   the second TCP connection.

Comment 3 J. Bruce Fields 2019-05-31 01:12:30 UTC

There's some work in progress upstream to add an nconnect= mount option that allows load balancing across multiple TCP connections (for any NFS version).  But they're all to the same IP address; I don't know if it will help in this case:

  http://marc.info/?i=155917564898.3988.6096672032831115016.stgit@noble.brown

We could also think about how to make the client behave in the 4.1 case more like it does in the 4.0 case.  It won't be exactly the same, because in the 4.1 case it has to behave as a single client with a lease that is shared between the two connections.  But IO operations at least could be distributed between the two connections.  I don't know if there would be any disadvantages.

Distributing traffic between interfaces by using different IP addresses for different mounts seems a little limited.  It would be nice if we could bring any benefits of session or clientid trunking to single mounts as well.

On the server side I suppose we could optionally allow a multihomed server to return different major ids, but there's probably some good reason RFC 5661 explicitly forbids this.  The correct way to do this might be just to run multiple knfsd's in different containers on the server, one container for each server IP address.  We would like to support that later in RHEL8.

Comment 4 Scott Mayhew 2019-05-31 11:16:21 UTC

I had asked Trond about the patches in his multipath_tcp branch at the Bakeathon a few weeks ago and he indicated that people didn't like the nconnect= mount option and he was planning on re-working it so that it would be a tunable.

Comment 5 Jacob Shivers 2019-05-31 13:50:12 UTC

(In reply to J. Bruce Fields from comment #3)

> We could also think about how to make the client behave in the 4.1 case more
> like it does in the 4.0 case.  It won't be exactly the same, because in the
> 4.1 case it has to behave as a single client with a lease that is shared
> between the two connections.  But IO operations at least could be
> distributed between the two connections.  I don't know if there would be any
> disadvantages.

An initial disadvantage I could imagine with client_id trunking/shared lease is determining which server IP address a SEQUENCE Call should be sent to.
More specifically, if there is a way to cycle between server side IP addresses in the event of a network partition between one of the server side ip addresses and the client side ip address.

You could have a situation where the client is only able to communicate with one share at the network level, but SEQUENCEs were sent to the other IP address causing the accessible NFS share to get ERR_EXPIRED due to the lack of lease renewal.
Depending on client side recovery mechanisms, you could have a situation where the NFS client can not access any NFS shares with the shared lease because of one server side ip address being inaccessible.

It would just require a sufficiently robust client side lease management and recovery system.

Comment 7 J. Bruce Fields 2020-03-18 16:41:20 UTC

There has been progress on the nconnect= mount option mentioned in comment 3--see bug 1761352.

It should also be possible upstream to run multiple NFS servers in different containers.  For that to work in RHEL 8, there's still work to be done--see bug 1365212.

I don't know if either of those would address these use cases.

We don't currently have a plan to segregate traffic to different connections by mountpoint, and I don't expect that to happen for 8.3.

Comment 8 Jacob Shivers 2020-06-12 19:29:08 UTC

(In reply to J. Bruce Fields from comment #7)
> There has been progress on the nconnect= mount option mentioned in comment
> 3--see bug 1761352.
> 
> It should also be possible upstream to run multiple NFS servers in different
> containers.  For that to work in RHEL 8, there's still work to be done--see
> bug 1365212.
> 
> I don't know if either of those would address these use cases.
> 
> We don't currently have a plan to segregate traffic to different connections
> by mountpoint, and I don't expect that to happen for 8.3.

Do you think that this BZ should remain opened or do you think it should be CLOSED as Deferred until there are plans to work on it?

Comment 9 J. Bruce Fields 2020-06-15 20:33:53 UTC

It might be useful to understand the customer's use case first.  Will nconnect= do the job?  If they're just trying to aggregate bandwidth from multiple connections, then that might be sufficient.  But perhaps they're trying to seggregate traffic in some other way--e.g. maybe they've got a latency-sensitive application running on one mountpoint, and it's missing deadlines because of bulk transfers on another mount point?

Comment 10 J. Bruce Fields 2020-07-16 16:20:43 UTC

(In reply to Jacob Shivers from comment #8)
> Do you think that this BZ should remain opened or do you think it should be
> CLOSED as Deferred until there are plans to work on it?

OK, I guess we should just close.  Please feel free to reopen if we get more details about use cases.

Comment 14 Jacob Shivers 2020-07-28 15:14:02 UTC

(In reply to J. Bruce Fields from comment #9)
> It might be useful to understand the customer's use case first.  Will
> nconnect= do the job?  If they're just trying to aggregate bandwidth from
> multiple connections, then that might be sufficient.  But perhaps they're
> trying to seggregate traffic in some other way--e.g. maybe they've got a
> latency-sensitive application running on one mountpoint, and it's missing
> deadlines because of bulk transfers on another mount point?

Sorry for the delay in response.

Indeed, the customer wants traffic segregation such that the TCP connection for NFS mounts is not multiplexed and they are individualized.
In the event that the NFS server has multiple IP addresses, the NFS client can connect to a distinct IP address instead of the connections being combined.

Comment 17 J. Bruce Fields 2020-08-12 20:26:47 UTC

Thanks.  Any details of their use case would be helpful.

Possible approaches, elaborating on comment 3 a little:

- As requested here, share the same client but use client or session trunking to associate a different tcp connection with each mount.  I guess you could control this with a mount option ("noshareconn"?).  I'm not sure how to implement it--would you need multiple rpc clients associated with each nfs client, and a way to look up the correct client to use for a given superblock.

- Allow the kernel to act as multiple NFS clients, presenting multiple client_owners to the server.  Currently the client_owner is global, and any mount to the same server ends up sharing all its protocol state with every other.  But we've discussed supporting multiple clients.  This would be necessary, for example, to allow NFS mounts from unprivileged containers.  So you'd probably do this by mounting from different containers.

- Partition on the server side: you can almost do this now with knfsd: create multiple containers in separate network namespaces, one for each IP address.  They won't share any state at all, so tcp connections in particular will all be separate.  I say "almost" because the functionality is new, and it only works correctly if you don't try to export the same filesystem from two different containers.  I'm not sure if that's required here.

nconnect might still be useful as well, even if it doesn't isolate workloads on the two mountpoints from each other.

Comment 18 Jacob Shivers 2020-08-18 20:47:18 UTC

(In reply to J. Bruce Fields from comment #17)
> Thanks.  Any details of their use case would be helpful.
> 
> Possible approaches, elaborating on comment 3 a little:
> 
> - As requested here, share the same client but use client or session
> trunking to associate a different tcp connection with each mount.  I guess
> you could control this with a mount option ("noshareconn"?).  I'm not sure
> how to implement it--would you need multiple rpc clients associated with
> each nfs client, and a way to look up the correct client to use for a given
> superblock.
> 
> - Allow the kernel to act as multiple NFS clients, presenting multiple
> client_owners to the server.  Currently the client_owner is global, and any
> mount to the same server ends up sharing all its protocol state with every
> other.  But we've discussed supporting multiple clients.  This would be
> necessary, for example, to allow NFS mounts from unprivileged containers. 
> So you'd probably do this by mounting from different containers.
> 
> - Partition on the server side: you can almost do this now with knfsd:
> create multiple containers in separate network namespaces, one for each IP
> address.  They won't share any state at all, so tcp connections in
> particular will all be separate.  I say "almost" because the functionality
> is new, and it only works correctly if you don't try to export the same
> filesystem from two different containers.  I'm not sure if that's required
> here.
> 
> nconnect might still be useful as well, even if it doesn't isolate workloads
> on the two mountpoints from each other.

Options two and three certainly see promising as they would not only address the customer's request, but also extend to different container use cases which has its own benefit.
Is there anything the customer can do now to assist in testing/developing these features?

Comment 20 Jacob Shivers 2020-10-05 19:24:33 UTC

I've discussed the use of containers to address the customer's request. They are completely adverse to the use of containers.

In your current estimate, do you see what amounts to option #1 being pursued or will the client id trunking most likely be implemented via containers, i.e knfsd or within the client kernel?

Comment 21 J. Bruce Fields 2020-10-06 15:19:43 UTC

(In reply to Jacob Shivers from comment #20)
> I've discussed the use of containers to address the customer's request. They
> are completely adverse to the use of containers.
> 
> In your current estimate, do you see what amounts to option #1 being pursued
> or will the client id trunking most likely be implemented via containers,
> i.e knfsd or within the client kernel?

Of those those options, nconnect and server containerization are the only ones getting any work right now.

I'm not aware of anyone actively working on #1 or #2.  I do know there's interest in #2, so expect it will happen eventually.

So I'm pessimistic about option #1.  I've posted upstream to try to gauge interest and make sure I haven't overlooked some simple solution: https://lore.kernel.org/linux-nfs/20201006151335.GB28306@fieldses.org

nconnect is coming in 8.3: https://bugzilla.redhat.com/show_bug.cgi?id=1761352.  All it does is round-robin RPC's across N TCP connections, which isn't what they're looking for, but perhaps it might happen for their workload that that makes it less likely for one user to consume all the bandwidth.  In any case, it's easy to try out, so it might be worth it even if it's a long shot.

Comment 22 Jacob Shivers 2020-10-06 16:14:07 UTC

(In reply to J. Bruce Fields from comment #21)
> (In reply to Jacob Shivers from comment #20)
> > I've discussed the use of containers to address the customer's request. They
> > are completely adverse to the use of containers.
> > 
> > In your current estimate, do you see what amounts to option #1 being pursued
> > or will the client id trunking most likely be implemented via containers,
> > i.e knfsd or within the client kernel?
> 
> Of those those options, nconnect and server containerization are the only
> ones getting any work right now.
> 
> I'm not aware of anyone actively working on #1 or #2.  I do know there's
> interest in #2, so expect it will happen eventually.
> 
> So I'm pessimistic about option #1.  I've posted upstream to try to gauge
> interest and make sure I haven't overlooked some simple solution:
> https://lore.kernel.org/linux-nfs/20201006151335.GB28306@fieldses.org
> 
> nconnect is coming in 8.3:
> https://bugzilla.redhat.com/show_bug.cgi?id=1761352.  All it does is
> round-robin RPC's across N TCP connections, which isn't what they're looking
> for, but perhaps it might happen for their workload that that makes it less
> likely for one user to consume all the bandwidth.  In any case, it's easy to
> try out, so it might be worth it even if it's a long shot.

Understood. I've asked the customer if they have done explicit tests with nconnect to see if the "freezing" still persists with mounts.
I expressed that this testing is necessary as if the freezing persists, then this helps to inform engineering/upstream that nconnect does not address all use cases of TCP/RPC starvation.

Comment 23 J. Bruce Fields 2020-10-06 19:51:57 UTC

(In reply to J. Bruce Fields from comment #21)
> I'm not aware of anyone actively working on #1 or #2.  I do know there's
> interest in #2, so expect it will happen eventually.

Actually, Ben Coddington reminds me that some of #2 is actually already done: bf11fbdb20b3 "NFS: Add sysfs support for per-container identifier" is in upstream kernel 5.3.  (In RHEL8, kernel patches are on the way, I'm not sure about userspace.)  That allows us to use a separate client identifier for each network namespace.  Mounts with different client identifiers will be treated as if they're from entirely separate clients, and the TCP connections will no longer be shared.

That would require setting up different network namespaces for the two mounts.  That doesn't have to be as heavyweight as two entirely different containers, but it still requires some configuration.

Ben points out that you can also change the client identifier by writing to /sys/module/nfs/parameters/nfs4_unique_id before doing the mount.  So if you mount the first filesystem, then write to that filesystem, then mount the second filesystem, the two mounts should end up with separate TCP connections.  You'd have to be careful not to allow the mounts to be done in parallel.  I'm really uneasy about depending on that.  It feels like something the sysfs interface wasn't really designed to do, and I'm afraid if it broke some day we wouldn't get much sympathy.  Ben's following up on that question upstream: https://lore.kernel.org/r/95542179-0C20-4A1F-A835-77E73AD70DB8@redhat.com

Comment 24 J. Bruce Fields 2020-10-06 19:53:42 UTC

(In reply to J. Bruce Fields from comment #23)
> So if you
> mount the first filesystem, then write to that filesystem

(Sorry, I meant "write to that sysfs file").

Comment 25 J. Bruce Fields 2020-10-07 12:29:03 UTC

On further examination I think the code isn't really designed to handle changing /sys/module/nfs/parameters/nfs4_unique_id after you already have mounts (it doesn't appear to handle server reboot recovery correctly, for example).  So I don't recommend that approach.

Comment 26 J. Bruce Fields 2020-10-07 21:39:18 UTC

(In reply to J. Bruce Fields from comment #25)
> (it doesn't appear to handle server reboot recovery correctly, for
> example).

Hm, no, I was wrong about that.

(I'm still wouldn't recommend that approach.)

Comment 27 Frank Sorenson 2020-11-19 14:43:21 UTC

(In reply to J. Bruce Fields from comment #17)
> Thanks.  Any details of their use case would be helpful.

As I understand it, the example given by the customer is essentially as follows:

the nfs server exports 2 or more directories:
   /data1 - contains some large files (for example, 10+ GiB)
   /data2 - contains many small files (several thousand 4 KiB files)

the nfs client mounts the filesystems at /data[12]

a process starts large writes to a file in /data1 (cp /large/file /data1)
another process starts some operations on files in /data2 (cp /small/files* /data2)


the large writes to /data1 result in queueing a large number of nfs WRITE rpc_tasks (with a few synchronous operations on either end of the WRITEs):
	LOOKUP
	OPEN
	SETATTR
	WRITE
	WRITE
	WRITE
	...
	WRITE
	CLOSE
	GETATTR

the operations to create and write 4 KiB files on /data2 result in entirely synchronous rpc_tasks:
	LOOKUP
	OPEN
	SETATTR
	WRITE
	CLOSE
	GETATTR


Because there are already a large number of rpc_tasks queued up by the large writes, each of the rpc_tasks generated by the 'small' process must wait to be serviced, causing the 'small' process to progress extremely slowly.


So essentially, the customer would like separate queues for each mount, such that the small, synchronous tasks are NOT sequenced behind a large number of already-existing slow tasks.

Comment 28 Jacob Shivers 2020-11-21 22:12:44 UTC

*** Bug 1838422 has been marked as a duplicate of this bug. ***

Comment 29 Jacob Shivers 2020-11-21 22:14:11 UTC

*** Bug 1838723 has been marked as a duplicate of this bug. ***

Comment 30 Jacob Shivers 2020-11-21 22:36:42 UTC

There was an existing "read" test whereby the performance of RHEL8 was compared against RHEL7 and there was a non-trivial performance improvement when using RHEL8:

### RHEL8

# uname -r
4.18.0-193.1.2.el8_2.x86_64

# lsblk /dev/vdc /dev/vdd
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vdc  252:32   0    1T  0 disk 
vdd  252:48   0  100G  0 disk

# mkfs.xfs /dev/vdc
meta-data=/dev/vdc               isize=512    agcount=4, agsize=67108864 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=268435456, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mkfs.xfs /dev/vdd
meta-data=/dev/vdd               isize=512    agcount=4, agsize=6553600 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mkdir /data{1,2}

# mount /dev/vdc /data1
# mount /dev/vdd /data2

# df -ht xfs /data*
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc        1.0T  7.2G 1017G   1% /data1
/dev/vdd        100G  746M  100G   1% /data2

# cd /data1
# for i in {1..8}; do fallocate -l 100g test$i; done

# ll
total 838860800
-rw-r--r--. 1 root root 107374182400 Oct  8 15:49 test1
-rw-r--r--. 1 root root 107374182400 Oct  8 15:49 test2
-rw-r--r--. 1 root root 107374182400 Oct  8 15:49 test3
-rw-r--r--. 1 root root 107374182400 Oct  8 15:49 test4
-rw-r--r--. 1 root root 107374182400 Oct  8 15:49 test5
-rw-r--r--. 1 root root 107374182400 Oct  8 15:49 test6
-rw-r--r--. 1 root root 107374182400 Oct  8 15:49 test7
-rw-r--r--. 1 root root 107374182400 Oct  8 15:49 test8

# cd /data2
# for i in {1..10000}; do fallocate -l 4k test$i; done

# ls -1 /data2 | wc -l
10000

# exportfs -v
/data1          192.168.124.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
/data2          192.168.124.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)


### NFS client (RHEL 8.3 beta)

# uname -r
4.18.0-221.el8.x86_64

# mkdir /mount{1,2}

# mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2

# df -ht nfs4
Filesystem                       Size  Used Avail Use% Mounted on
nfs-server-8.example.net:/data1  1.0T  808G  217G  79% /mount1
nfs-server-8.example.net:/data2  100G  790M  100G   1% /mount2

# cd /mount2
# time fgrep abc *

real    0m7.768s
user    0m0.471s
sys     0m0.707s

# umount /mount1 /mount2
# mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2
# cd /mount1
# while :; do cat * > /dev/null; done

# cd /mount2

# time fgrep abc *

real    0m11.052s
user    0m0.484s
sys     0m0.749s

# time fgrep abc *

real    0m9.035s
user    0m0.456s
sys     0m0.674s

# time fgrep abc *

real    0m8.151s
user    0m0.458s
sys     0m0.632s


# umount /mount1 /mount2
# mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2

# cd /mount1
# while :; do cat * > /dev/null; done

# cd /mount2

# time fgrep abc *

real    0m9.524s
user    0m0.465s
sys     0m0.687s

# time fgrep abc *

real    0m5.456s
user    0m0.456s
sys     0m0.591s

# time fgrep abc *

real    0m6.063s
user    0m0.456s
sys     0m0.554s

# umount /mount1 /mount2


# mount nfs-server-8.example.net:/data1 /mount1 -o nconnect=16 ; mount nfs-server-8.example.net:/data2 /mount2 -o nconnect=16
# grep xprt /proc/self/mountstats -c
32

# cd /mount1
# while :; do cat * > /dev/null; done

# cd /mount2
# time fgrep abc *

real    0m8.166s
user    0m0.460s
sys     0m0.673s

# time fgrep abc *

real    0m4.510s
user    0m0.486s
sys     0m0.535s

# time fgrep abc *

real    0m4.070s
user    0m0.424s
sys     0m0.488s

### RHEL7

# uname -r
3.10.0-1127.10.1.el7.x86_64

# mkdir /mount{1,2}

# mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2

# df -ht nfs4
Filesystem                       Size  Used Avail Use% Mounted on
nfs-server-8.example.net:/data1  1.0T  808G  217G  79% /mount1
nfs-server-8.example.net:/data2  100G  790M  100G   1% /mount2

# cd /mount2
# time fgrep abc *

# time fgrep abc *

real    0m5.443s
user    0m0.478s
sys     0m0.533s

# umount /mount{1,2}

# mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2

# cd /mount1
# while :; do cat * > /dev/null; done

# cd /mount2
# time fgrep abc *

real    8m29.302s
user    0m0.833s
sys     0m2.080s

# cd
# cd

# umount /mount{1,2}
# mount nfs-server-8.example.net:/data1 /mount1 ; mount nfs-server-8.example.net:/data2 /mount2

# cd /mount1
# while :; do cat * > /dev/null; done

# cd /mount2
# time fgrep abc *

real    8m53.381s
user    0m0.747s
sys     0m2.170s


Expanding on what Frank said, if the customer were able to mount share1 to ipv1 and share2 to ipv2 with NFSv4.1 and later, this would avoid the synchronous tasks being queued by the slower tasks assuming the behavior was distinct based on each share.

Comment 40 Dave Wysochanski 2021-01-15 15:30:32 UTC

Changing the title in an attempt to better describe what is asked for, though is a bit verbose and not sure I hit the mark.

Comment 41 Dave Wysochanski 2021-01-15 15:32:00 UTC

Also I think this may need a more explicit reproducer or it could get lost in the weeds.

Comment 42 Frank Sorenson 2021-01-15 16:58:15 UTC

Here is information on my reproducer, and some results.

The environment is an nfs server with two IPs, exporting two directories; one directory has a few very large files, and the other has a lot of very small files.  The network bandwidth is limited somewhat.


write test:

configure server with multiple IPs; in this case, I'm using 192.168.122.61 and 192.168.122.161

server# mkdir /data{1,2}
server# echo "/data1 *(rw,no_root_squash,sync)" >> /etc/exports
server# echo "/data2 *(rw,no_root_squash,sync)" >> /etc/exports
server# exportfs -arv



client# mkdir /{mount,data}{1,2}
client# for i in {1..8} ; do fallocate -l 5G /data1/large_$i ; done
client# for i in {1..10000} ; do fallocate -l 4K /data2/small_$i ; done

client# mount server:/data1 /mount1 -overs=4.2,sec=sys
client# mount server:/data2 /mount2 -overs=4.2,sec=sys


*** this portion should be tested and verified... is it necessary? and if so, are these good values ***
on client, limit the network bandwidth to both server IPs somewhat:

client# IFACE=bond0
client# IP1=192.168.122.61
client# IP2=192.168.122.161

(add a class based queue, and tell the kernel that for calculations, assume that it is a 1 gbit interface)
client# tc qdisc add dev $IFACE root handle 1: cbq avpkt 1000 bandwidth 1gbit

(add a 5 Mbit class)
client# tc class add dev $IFACE parent 1: classid 1:1 cbq rate 5mbit allot 1500 prio 5 bounded isolated

(filter which traffic should be shaped)
client# tc filter add dev $IFACE parent 1: protocol ip prio 16 u32 match ip dst $IP1 match ip dport 2049 0xffff flowid 1:1
client# tc filter add dev $IFACE parent 1: protocol ip prio 16 u32 match ip dst $IP2 match ip dport 2049 0xffff flowid 1:1


obtain a baseline for copying the 'small_*' files to the nfs mount
client# time bash -c 'cp /data2/* /mount2'

obtain a baseline for deleting the 'small_*' files from the nfs mount
client# time bash -c 'rm -f /mount2/*'


on the client, open a second shell and start a loop copying the 'large_*' files to the nfs mount
client# while [[ 42 ]] ; do cp -f /data1/* /mount1 ; done


back in the client's first terminal, run the actual 'cp' and 'rm' tests
client# time bash -c 'cp /data2/* /mount2'

client# time bash -c 'rm -f /mount2/*'


*** note: these tests will run very slowly, so it might would probably make sense to actually run under 'timeout' and see the progress when that timeout expires.  for example:

client# time bash -c 'timeout 30m cp /data2/* /mount2'
client# time bash -c 'timeout 10m rm -f /mount2/*'

Comment 43 Frank Sorenson 2021-01-15 17:47:53 UTC

here are some results with the test as described in comment 42



# uname -r
3.10.0-1139.el7.x86_64

baselines:
	# time cp /data2/* /mount2/

	real	6m44.090s
	user	0m0.725s
	sys	0m20.734s

	10,000 files in 6:44

	# time bash -c 'rm -f /mount2/*'
	real	2m8.430s
	user	0m0.264s
	sys	0m9.298s

	10,000 files in 2:08

actual test (with the 'cp large_*' loop running)

(note: I got impatient, and interrupted the tests prematurely, but you can see the progress)


	# time cp /data2/* /mount2/
	^C

	real	27m57.309s
	user	0m0.070s
	sys	0m0.097s

	(how much progress was made in those 28 minutes?)
	time find /mount2 -mindepth 1 -type f | wc -l
	16

	real	0m33.999s
	user	0m0.004s
	sys	0m0.020s

	so 16 files were created in about 28 minutes (and 'find' on the directory of 16 entries took 34 seconds)

	# time bash -c 'rm -f /mount2/*'
	^C

	real	8m22.740s
	user	0m0.008s
	sys	0m0.034s


	(how many files were deleted before I got impatient and interrupted?)
	# time find /mount2 -mindepth 1 -type f | wc -l
	2

	real	0m29.864s
	user	0m0.002s
	sys	0m0.023s

	so 14 files were deleted in 8 1/2 minutes (and 'find' on the directory of 2 entries took about 30 seconds)


RHEL 7 write test with two server IPs:

baseline:
	# uname -r
	3.10.0-1139.el7.x86_64

	# time bash -c 'cp /data2/* /mount2/'
	real	6m8.158s
	user	0m0.881s
	sys	0m25.488s

	$ echo 'scale=3; 10000/(6*60+8)' | bc
	27.173
	(files copied/second)

	# time bash -c 'rm -f /mount2/small*'
	real	2m15.574s
	user	0m0.287s
	sys	0m10.097s

	$ echo 'scale=3; 10000/(2*60+15)' | bc
	74.074
	(files removed/second)


	# while [[ 42 ]] ; do cp -f /data1/* /mount1/ ; done

actual test:
	
	# time bash -c 'cp /data2/* /mount2'
	real	26m13.612s
	user	0m0.967s
	sys	0m22.356s

	# echo "scale=3 ; 10000/1560" | bc
	6.410
	(files created/second)


	# time find /mount2 -type f | wc -l
	10000

	real	2m37.300s
	user	0m0.165s
	sys	0m3.247s


	# time bash -c 'rm -f /mount2/small*'

	real	10m54.551s
	user	0m0.327s
	sys	0m8.564s

	# echo "scale=3 ; 10000/654" | bc
	15.290
	(files deleted/second)



RHEL 8 write test with two server IPs and nfs v4.2

baseline:
	# uname -r
	4.18.0-193.14.3.el8_2.x86_64

	# time bash -c 'cp /data2/* /mount2/'

	real	5m36.192s
	user	0m0.757s
	sys	0m9.365s

	# echo "scale=3; 10000/(5*60+36.192)" | bc
	29.744
	(files created/second)


	# time bash -c 'rm -f /mount2/*'

	real	2m3.034s
	user	0m0.219s
	sys	0m2.957s

	# echo "scale=3; 10000/(2*60+3.034)" | bc
	81.278
	(files remoed/second)

actual test:
	# time bash -c 'cp /data2/* /mount2/'
	^C

	real	73m57.468s
	user	0m0.174s
	sys	0m1.392s
	(interrupted)

	971 files were created

	# echo "scale=3; 971/(73*60+57.468)" | bc
	.218


	# time bash -c 'rm -f /mount2/*'

	real	11m46.588s
	user	0m0.054s
	sys	0m0.637s

	# echo "scale=3; 10000/(11*60+46.588)" | bc
	14.152


RHEL 8 write test with two server IPs and nfs v4.2

baseline:
	# uname -r
	4.18.0-193.14.3.el8_2.x86_64

	# time bash -c 'cp /data2/* /mount2/'
	Elapsed time:  331.615211196 - 5:31.615

	# echo 'scale=3;10000/331.615211196'|bc
	30.155

	# time bash -c 'rm -f /mount2/*'
	Elapsed time:  124.313464477 - 2:04.313

	# echo 'scale=3;10000/124.313464477'|bc
	80.441

actual test:

	$ time bash -c 'cp /data2/* /mount2/'
	^C
	real	14m15.001s
	user	0m0.604s
	sys	0m5.583s

	interrupted after 5019 files

	# echo 'scale=3 ; 5019 / (14*60+15.001)' | bc
	5.870
	(files created/second)


	created 10,000 files on server, re-ran 'rm' test (with large copies running):

	# ./timer bash -c 'rm -f /mount2/*'
	Elapsed time:  439.417153663 - 7:19.417
	user CPU time: 0.377233 - 0.377
	sys CPU time:  3.986195 - 3.986

	# echo 'scale=3;10000/439.417153663' | bc
	22.757
	(files deleted/second)

results summary in files/second:
	RHEL 7.9 - nfs v4.0, 2 IPs:
	  cp baseline	cp result	rm baseline	rm result
	  27.13		6.357		74.074		15.290

	RHEL 7.9 - nfs v4.2, 1 IP:
	  24.752	0.010		78.125		0.027

	rhel 8.2 - nfs v4.0, 2 IPs:
	  30.155	5.870		80.441		22.757

	rhel 8.2 - nfs v4.2, 1 IP:
	  29.744	0.218		81.278		14.152


I also tested using nconnect, and did not see any improvement (I am not finding my results)



similar *read* test performed by Jacob Shivers (BZ1703850 comment 30) with single server IP


read test results from RHEL 8.3 nfs client:

	# uname -r
	4.18.0-221.el8.x86_64

	baseline 'read' test of 10000 'small*' files:
	# cd /mount2
	# time fgrep abc *

	real    0m7.768s
	user    0m0.471s
	sys     0m0.707s

	unmount and remount to clear out cached pages:

	# umount /mount1 /mount2
	# mount server:/data1 /mount1 ; mount server:/data2 /mount2

	the loop reading the 'large*' files:
	# cd /mount1
	# while :; do cat * > /dev/null; done


	unmount and remount with 'nconnect':
	# umount /mount1 /mount2
	# mount server:/data1 /mount1 -o nconnect=16 ; mount server:/data2 /mount2 -o nconnect=16
	# grep xprt /proc/self/mountstats -c
	32

	# cd /mount1
	# while :; do cat * > /dev/null; done

	# cd /mount2
	# time fgrep abc *

	real    0m8.166s
	user    0m0.460s
	sys     0m0.673s

	# time fgrep abc *

	real    0m4.510s
	user    0m0.486s
	sys     0m0.535s

	# time fgrep abc *

	real    0m4.070s
	user    0m0.424s
	sys     0m0.488s

	(** note: these 'time' commands do not include the time required to build the list of files passed to 'time grep' command, since that's expanded by the shell prior to running 'time' -- would be better to do: time bash -c 'time fgrep abc */')


read test results from RHEL 7 nfs client:

	# uname -r
	3.10.0-1127.10.1.el7.x86_64

	# cd /mount2
	# time fgrep abc *

	real    0m5.443s
	user    0m0.478s
	sys     0m0.533s

	# umount /mount{1,2}
	# mount server:/data1 /mount1 ; mount server:/data2 /mount2


	# cd /mount1
	# while :; do cat * > /dev/null; done


	# cd /mount2
	# time fgrep abc *

	real    8m29.302s
	user    0m0.833s
	sys     0m2.080s


	# umount /mount{1,2}
	# mount server:/data1 /mount1 ; mount server:/data2 /mount2

	# cd /mount1
	# while :; do cat * > /dev/null; done

	# cd /mount2
	# time fgrep abc *

	real    8m53.381s
	user    0m0.747s
	sys     0m2.170s

Comment 49 Daire Byrne 2021-03-04 16:29:11 UTC

I landed here via Bruce's associated NFS mailing list thread....

I am just wondering if this is the same bottleneck as one I tried to describe in my rather epic "NFS re-export" thread: https://marc.info/?l=linux-nfs&m=160077787901987&w=4

Long story short, when you have already read lots of data into a client's pagecache (or fscache/cachefiles), you can't reuse it later until you do some metadata lookups first to validate. If you are reading or writing to the server at the time, these metadata lookups can take longer than if you hadn't bothered caching that data locally.

My assumption was that for a single server mount, the queue of operations just gets stuck chugging through the longer read/write ones with long waits between metadata lookups.

I also found that mounting a different server entirely and testing metadata to that was so much better even if the network link was being saturated by the read or writes to the other server. I figure I'll find the same if I use a multi-homed server too and the same export on different mount paths.

I had hoped that nconnect would provide some extra parallel metadata performance for independent client processes but it wasn't to be.

Regards,

Daire

Comment 50 J. Bruce Fields 2021-03-11 16:05:23 UTC

Upstream feedback is that we need more evidence of exactly where the performance problem is:

https://lore.kernel.org/linux-nfs/e06c31e4211cefda52091c7710d871f44dc9160e.camel@hammerspace.com/
"AFAICS Tom Talpey's question is the relevant one. Why is there a
performance regression being seen by these setups when they share the
same connection? Is it really the connection, or is it the fact that
they all share the same fixed-slot session?"

I don't have a suggestion for how to test that easily.

Comment 51 Daire Byrne 2021-03-12 16:58:56 UTC

I don't know if it's useful or relevant, but I found that the "metadata starvation" problem when another process is doing lots of reads or writes to the same mountpoint, was easier to demonstrate when the client network was congested.

I could simulate that artificially with an ingress qdisc on the client:

# setup the artificial ingress limit
modprobe ifb numifbs=1
ip link set dev ifb0 up
tc qdisc add dev eth0 ingress
tc qdisc add dev ifb0 root handle 1: htb default 10 r2q 4000
tc class add dev ifb0 parent 1: classid :10 htb rate 200mbit
tc filter add dev eth0 parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev ifb0

Then mount your NFS server, do lots of reads with one process and walk the remote filesystem with another. Another way to make it even slower is to have multiple threads of simultaneous readers.

But even with this low artificial 200mbit limit, I can mount another server and the walk of that filesystem breezes through nice and fast. I imagine it would be the same if it was the same multi-homed server using a different IP.

It just seems like you can't have fast bulk IO and fast metadata response from the same mount point at the same time.

Daire

Comment 52 J. Bruce Fields 2021-03-15 20:02:45 UTC

(In reply to Daire Byrne from comment #51)
> I don't know if it's useful or relevant, but I found that the "metadata
> starvation" problem when another process is doing lots of reads or writes to
> the same mountpoint, was easier to demonstrate when the client network was
> congested.

It might be useful to add to the upstream thread, but I think it still doesn't answer the question about why this happens.

Comment 58 J. Bruce Fields 2021-06-17 20:53:12 UTC

See also https://lore.kernel.org/linux-nfs/20210616011013.50547-1-olga.kornievskaia@gmail.com/T/#t

"This patch series attempts to allow for new mounts that are to the
same server (ie nfsv4.1+ session trunkable servers) but different
network addresses to use connections associated with those mounts
but still use the same client structure."

Sounds like the same idea that was already rejected, but it'll be interesting to see where this goes.

Comment 59 Jacob Shivers 2021-06-19 23:09:09 UTC

(In reply to J. Bruce Fields from comment #58)
> See also
> https://lore.kernel.org/linux-nfs/20210616011013.50547-1-olga.
> kornievskaia/T/#t
> 
> "This patch series attempts to allow for new mounts that are to the
> same server (ie nfsv4.1+ session trunkable servers) but different
> network addresses to use connections associated with those mounts
> but still use the same client structure."
> 
> Sounds like the same idea that was already rejected, but it'll be
> interesting to see where this goes.

Did some testing and the patches work as expected, i.e. allowing for NFSv4.1+
to use distinct TCP streams for a given NFS server IP address if specified at
mount with the necessary mount option (max_connect).  I will note that an
existing patch set is required in order to apply the transport changes as noted
below.

I have not done any additional testing yet, but I will work on that
today/tomorrow.

# git branch -a | grep nfs
  remotes/nfs_client/ioctl
  remotes/nfs_client/ioctl-3.10
  remotes/nfs_client/knfsd-devel
  remotes/nfs_client/linux-next
  remotes/nfs_client/master
  remotes/nfs_client/multipath_tcp
  remotes/nfs_client/testing

# grep '"nfs_client"' -A2 .git/config
[remote "nfs_client"]
        url = git://git.linux-nfs.org/projects/trondmy/linux-nfs.git
        fetch = +refs/heads/*:refs/remotes/nfs_client/*

# git checkout nfs_client/linux-next
Previous HEAD position was e14c779adebe Merge tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
HEAD is now at 009c9aa5be65 Linux 5.13-rc6

# git checkout -b test_no_collapse
Switched to a new branch 'test_no_collapse'

# for i in {2..14} ; do wget 'https://lore.kernel.org/linux-nfs/20210608195922.88655-'$i'-olga.kornievskaia@gmail.com/raw' -O sunrpc-$(( i - 1 )).patch ; done
# for i in {2..7} ; do wget 'https://lore.kernel.org/linux-nfs/20210616011013.50547-'$i'-olga.kornievskaia@gmail.com/raw' -O nfs-$(( i - 1 )).patch ; done

# for i in {1..13} ; do git apply sunrpc-$i.patch ; done
# for i in {1..6} ; do git apply nfs-$i.patch ; done

# make menuconfig
# date; time make -j8; time make -j8 modules; date
# date; time make -j8 modules_install; time make -j 8 install; date

# grubby --set-default=/boot/vmlinuz-5.13.0-rc6+
# systemctl reboot

# uname -r
5.13.0-rc6+

# mkdir /mnt/test{1..3}

# getent hosts nfs-server-7.example.net
192.168.124.214 ad-nfs-server.example.net nfs-server-7.example.net
192.168.124.213 ad-nfs-server.example.net nfs-server-7.example.net
192.168.124.130 ad-nfs-server.example.net nfs-server-7.example.net

# mount 192.168.124.213:/test1 /mnt/test1 -o vers=4.0
# mount 192.168.124.214:/test2 /mnt/test2 -o vers=4.0

# ss -ptone '( dport = :2049 )' | cat
State  Recv-Q   Send-Q          Local Address:Port          Peer Address:Port                                                                                   
ESTAB  0        0             192.168.124.154:955        192.168.124.214:2049    timer:(keepalive,3.950ms,0) ino:32477 sk:2a1d5780                              
ESTAB  0        0             192.168.124.154:947        192.168.124.213:2049    timer:(keepalive,5.280ms,0) ino:32471 sk:2b013301 

# mount 192.168.124.130:/test3 /mnt/test3 -o vers=4.0

# ss -ptone '( dport = :2049 )' | cat
State  Recv-Q   Send-Q          Local Address:Port          Peer Address:Port                                                                                   
ESTAB  0        0             192.168.124.154:955        192.168.124.214:2049    timer:(keepalive,6.010ms,0) ino:32477 sk:2a1d5780                              
ESTAB  0        0             192.168.124.154:947        192.168.124.213:2049    timer:(keepalive,7.030ms,0) ino:32471 sk:2b013301                              
ESTAB  0        0             192.168.124.154:959        192.168.124.130:2049    timer:(keepalive,8.130ms,0) ino:32478 sk:2027ffc3

# umount /mnt/test*

# mount 192.168.124.213:/test1 /mnt/test1 -o vers=4.1
# mount 192.168.124.214:/test2 /mnt/test2 -o vers=4.1

# ss -ptone '( dport = :2049 )' | cat
State  Recv-Q   Send-Q          Local Address:Port          Peer Address:Port   
ESTAB  0        0             192.168.124.154:788        192.168.124.213:2049    timer:(keepalive,7.270ms,0) ino:32491 sk:77bca981

# umount /mnt/test*

# mount 192.168.124.213:/test1 /mnt/test1 -o vers=4.1,max_connect=2
# mount 192.168.124.214:/test2 /mnt/test2 -o vers=4.1,max_connect=2

# ss -ptone '( dport = :2049 )' | cat
State  Recv-Q   Send-Q          Local Address:Port          Peer Address:Port                                                                                   
ESTAB  0        0             192.168.124.154:957        192.168.124.213:2049    timer:(keepalive,7.920ms,0) ino:32505 sk:c4be294b                              
ESTAB  0        0             192.168.124.154:721        192.168.124.214:2049    timer:(keepalive,7.920ms,0) ino:32511 sk:792e9056

# mount 192.168.124.130:/test3 /mnt/test3 -o vers=4.1

# ss -ptone '( dport = :2049 )' | cat
State  Recv-Q   Send-Q          Local Address:Port          Peer Address:Port                                                                                   
ESTAB  0        0             192.168.124.154:957        192.168.124.213:2049    timer:(keepalive,7.750ms,0) ino:32505 sk:c4be294b                              
ESTAB  0        0             192.168.124.154:721        192.168.124.214:2049    timer:(keepalive,7.750ms,0) ino:32511 sk:792e9056

# journalctl  | tail -2
Jun 19 18:53:48 git-box-8.example.net kernel: SUNRPC: reached max allowed number (1) did not add transport to server: 192.168.124.214
Jun 19 18:54:40 git-box-8.example.net kernel: SUNRPC: reached max allowed number (2) did not add transport to server: 192.168.124.130

# mount 192.168.124.130:/test3 /mnt/test3 -o vers=4.1,max_connect=3

# ss -ptone '( dport = :2049 )' | cat
State  Recv-Q   Send-Q          Local Address:Port          Peer Address:Port                                                                                   
ESTAB  0        0             192.168.124.154:957        192.168.124.213:2049    timer:(keepalive,7.690ms,0) ino:32505 sk:c4be294b                              
ESTAB  0        0             192.168.124.154:721        192.168.124.214:2049    timer:(keepalive,7.690ms,0) ino:32511 sk:792e9056 

# mount nfs-server-7.example.net:/test3 /mnt/test3 -o vers=4.1

# ss -ptone '( dport = :2049 )' | cat
State  Recv-Q   Send-Q          Local Address:Port          Peer Address:Port                                                                                   
ESTAB  0        0             192.168.124.154:957        192.168.124.213:2049    timer:(keepalive,4.760ms,0) ino:32505 sk:c4be294b                              
ESTAB  0        0             192.168.124.154:721        192.168.124.214:2049    timer:(keepalive,4.760ms,0) ino:32511 sk:792e905

# mount 192.168.124.214:/test3 /mnt/test3 -o vers=4.1,max_connect=3

# journalctl  | tail -2
Jun 19 18:56:00 git-box-8.example.net kernel: SUNRPC: reached max allowed number (2) did not add transport to server: 192.168.124.130
Jun 19 18:56:41 git-box-8.example.net kernel: RPC:   addr 192.168.124.214 already in xprt switch

Comment 60 J. Bruce Fields 2021-06-28 15:05:07 UTC

I've seen no upstream response to Olga's patches.

Another proposal which would also solve this problem, from Neil Brown:

https://lore.kernel.org/linux-nfs/162458475606.28671.1835069742861755259@noble.neil.brown.name/

"It is possible to avoid this sharing by creating a separate network
namespace for the new connections, but this can often be overly
burdensome.  This patch introduces the concept of "NFS namespaces" which
allows one group of NFS mounts to be completely separate from others
without the need for a completely separate network namespace."

Comment 62 Jacob Shivers 2021-10-11 14:58:57 UTC

(In reply to J. Bruce Fields from comment #60)
> I've seen no upstream response to Olga's patches.
> 
> Another proposal which would also solve this problem, from Neil Brown:
> 
> https://lore.kernel.org/linux-nfs/162458475606.28671.
> 1835069742861755259.brown.name/
> 
> "It is possible to avoid this sharing by creating a separate network
> namespace for the new connections, but this can often be overly
> burdensome.  This patch introduces the concept of "NFS namespaces" which
> allows one group of NFS mounts to be completely separate from others
> without the need for a completely separate network namespace."

Hello Bruce and Ben,

Olga's patches were included in the upstream kernel in v5.14-rc5-36-g7e13420 and can be readily tested on Fedora Rawhide.

      SUNRPC keep track of number of transports to unique addresses
      SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
      NFSv4 introduce max_connect mount options
      SUNRPC enforce creation of no more than max_connect xprts
      NFSv4.1 add network transport when session trunking is detected

# uname -r
5.15.0-0.rc4.20211008git1da38549dd64.36.fc36.x86_64

# mount 192.168.124.213:/test1 /mnt/test1 -o vers=4.1,max_connect=2

# mount 192.168.124.214:/test2 /mnt/test2 -o vers=4.1,max_connect=2

# ss -ptone '( dport = :2049 )' | cat
State Recv-Q Send-Q   Local Address:Port    Peer Address:PortProcess                                                 
ESTAB 0      0      192.168.124.142:915  192.168.124.213:2049 timer:(keepalive,9.557ms,0) ino:24466 sk:1 cgroup:/ <->
ESTAB 0      0      192.168.124.142:713  192.168.124.214:2049 timer:(keepalive,9.549ms,0) ino:24600 sk:2 cgroup:/ <->


I have asked the customer if they would be willing to run a test kernel to help determine if the patch set is ready for inclusion into RHEL. The patch set provides the feature requirement that is the basis for this RFE/BZ. Is there anything that support delivery or the customer can do to help add these features to RHEL 8? I know that RHEL 8.6 is a release that is a mix of stability and new features. It would seem that this may be the last opportunity to include the feature into RHEL8 unless you think RHEL 8.7 would be an option.

More than willing to help and I am sure the customer feels the same in order to get these features into a near-term RHEL release.

Thanks

Comment 63 Benjamin Coddington 2021-10-12 11:33:06 UTC

Hi Jacob, I don't think Olga's work is going to give the customer what they want, because I think the ability to add a new distinct server endpoint doesn't restrict that specific mount to using only that specific endpoint.  Instead, it adds the new endpoint connection to the transport switch.  That's my reading of the code -- I haven't tested it.

Are you seeing that IO on the mount is restricted to only the specific server endpoint?

Comment 64 Jacob Shivers 2021-10-12 14:56:14 UTC

(In reply to Benjamin Coddington from comment #63)
> Hi Jacob, I don't think Olga's work is going to give the customer what they
> want, because I think the ability to add a new distinct server endpoint
> doesn't restrict that specific mount to using only that specific endpoint. 
> Instead, it adds the new endpoint connection to the transport switch. 
> That's my reading of the code -- I haven't tested it.
> 
> Are you seeing that IO on the mount is restricted to only the specific
> server endpoint?

Your reading was correct and I should have tested again. It basically sends IO in a round-robin fashion. I am going to test disconnecting a server interface to see what recovery looks like, i.e. sending IO to the other IP address. I'll update the BZ once I have completed some additional testing.

Comment 65 Jacob Shivers 2021-10-12 23:19:40 UTC

If an interface is dropped/removed from the NFS server while the NFS client is writing, the NFS client is in a loop of sending duplicate ACKs for the interface that is accessible. While it may be possible to assign the IP address for the removed interface to a remaining interface so long as both interfaces are in the same subnet/vlan, this is not really a solution.

This is a nice feature for IO distribution and could see additional benefits when coupled with nconnect, this does not address IO isolation per IP address on NFSv4.1+

I am going to go back and review Neil Brown's patch set to see if there have been any changes.

 2798 347.605508208 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 8918 Len: 26
 2801 347.640395034 192.168.124.213 → 192.168.124.142 NFS 254 V4 Reply (Call In 2798) WRITE
 2806 348.641256568 192.168.124.142 → 192.168.124.214 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 8944 Len: 26
 2809 348.676883776 192.168.124.214 → 192.168.124.142 NFS 246 V4 Reply (Call In 2806) WRITE
 2813 349.678034478 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 8970 Len: 26
 2817 349.724954306 192.168.124.213 → 192.168.124.142 NFS 254 V4 Reply (Call In 2813) WRITE
 2822 350.725853293 192.168.124.142 → 192.168.124.214 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 8996 Len: 26
 2825 350.752136075 192.168.124.214 → 192.168.124.142 NFS 246 V4 Reply (Call In 2822) WRITE
 2829 351.752951787 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9022 Len: 26
 2833 351.779458628 192.168.124.213 → 192.168.124.142 NFS 254 V4 Reply (Call In 2829) WRITE
 2838 352.780406171 192.168.124.142 → 192.168.124.214 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9048 Len: 26
 2842 352.860171731 192.168.124.214 → 192.168.124.142 NFS 246 V4 Reply (Call In 2838) WRITE
 2846 353.861176709 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9074 Len: 26
 2849 353.900166854 192.168.124.213 → 192.168.124.142 NFS 254 V4 Reply (Call In 2846) WRITE
 2854 354.901086185 192.168.124.142 → 192.168.124.214 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9100 Len: 26
 2858 355.053472681 192.168.124.214 → 192.168.124.142 NFS 246 V4 Reply (Call In 2854) WRITE
 2863 356.054473664 192.168.124.142 → 192.168.124.213 NFS 330 V4 Call WRITE StateID: 0x3dc7 Offset: 9126 Len: 26
 2886 374.949250397 192.168.124.142 → 192.168.124.214 NFS 254 V4 Call SEQUENCE
 2888 374.949971516 192.168.124.214 → 192.168.124.142 NFS 218 V4 Reply (Call In 2886) SEQUENCE

    3 2.465860428 192.168.124.142 → 192.168.124.214 TCP 66 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073755625 TSecr=2871260084
    4 2.466199346 192.168.124.214 → 192.168.124.142 TCP 66 [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871270324 TSecr=3073520106
   27 12.705836369 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#1] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073765865 TSecr=2871270324
   28 12.706153366 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#1] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871280564 TSecr=3073520106
   43 22.945838678 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#2] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073776105 TSecr=2871280564
   45 22.946278708 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#2] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871290804 TSecr=3073520106
   60 33.185866132 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#3] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073786345 TSecr=2871290804
   61 33.186293691 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#3] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871301044 TSecr=3073520106
   76 43.425716907 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#4] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073796585 TSecr=2871301044
   77 43.425984444 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#4] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871311284 TSecr=3073520106
   94 53.665836603 192.168.124.142 → 192.168.124.214 TCP 66 [TCP Dup ACK 3#5] 790 → 2049 [ACK] Seq=1 Ack=1 Win=3113 Len=0 TSval=3073806825 TSecr=2871311284
   95 53.666220177 192.168.124.214 → 192.168.124.142 TCP 66 [TCP Dup ACK 4#5] [TCP ACKed unseen segment] 2049 → 790 [ACK] Seq=1 Ack=2 Win=14544 Len=0 TSval=2871321524 TSecr=3073520106

Comment 66 Jacob Shivers 2021-11-23 22:27:35 UTC

I went back and applied the test patches from Neil Brown and here are the
initial results. Prior to this I had been investigating two different kernel
panics, but decided that I should remain focused on the results of the tests.
There are two results for the baseline `# cp`, the initial and the second
following a `# cp` with the loop running. I believe the second result is more
accurate and would seem more inline as the `# cp` should be similar as the
other namespace is not in use.

+-------------------+---------------------------------------------------------------------------+------------------------------------+------------------------------------+------------------------------------+
|    Mount type     |                              Baseline `# cp`                              |          Baseline `# rm`           |          `# cp` with loop          |          `# rm` with loop          |
+-------------------+---------------------------------------------------------------------------+------------------------------------+------------------------------------+------------------------------------+
| Without namespace | 10K 14.4060 files/second || 10K 8.7600 files/second | 10K 69.4229 files/second | 1130 0.6277 files/second | 1208 2.0102 files/second |
| With namespace    | 10K 06.8001 files/second || 10K 9.7285 files/second | 10K 83.1179 files/second | 3501 1.9442 files/second | 3709 6.1792 files/second |
+-------------------+---------------------------------------------------------------------------+------------------------------------+------------------------------------+------------------------------------+

The namespace feature does improve performance, though ideally more tests
should be conducted in a more controlled environment. All that being said, a
more intelligent RPC queue may be more fruitful to address the limitations of
the current FIFO model though that is certainly non-trivial.

At this time, I don't know where exactly to proceed. There have been no further comments on the patch proposed from Neil Brown and he admittedly said that the patches were not designed to address the use case they are being tested for as cited below:

---------------------------------------8<--------------------------------------

https://lore.kernel.org/linux-nfs/162513954601.3001.5763461156445846045@noble.neil.brown.name/


> 
> I'm just wondering if this could also help with the problem described
> in this thread:
> 
> https://marc.info/?t=160199739400001&r=2&w=4

Not really a good fit for that problem.

---------------------------------------8<--------------------------------------

As noted later by Neil, it is easier said than done to try and track down where
the bottleneck for this issue exists.

Any suggestions on what further testing could be done would be well received.
The original thread where this issue was brought up stream has remained silent
for some time.

https://lore.kernel.org/linux-nfs/20201006151335.GB28306@fieldses.org/




### Setup

 ### NFS Server

# mkfs.xfs /dev/vdc
meta-data=/dev/vdc               isize=512    agcount=4, agsize=13107200 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=52428800, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=25600, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount /dev/vdc /exports/
# cd /exports/
# mkdir data{1,2}
# chmod 777 *

# exportfs -rav
exporting *:/exports/data2
exporting *:/exports/data1

# ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens3    inet 192.168.124.138/24 brd 192.168.124.255 scope global noprefixroute ens3\       valid_lft forever preferred_lft forever
3: ens13    inet 192.168.124.20/24 brd 192.168.124.255 scope global dynamic noprefixroute ens13\       valid_lft 525sec preferred_lft 525sec


 ### NFS client

# mkfs.xfs /dev/vdd
meta-data=/dev/vdd               isize=512    agcount=4, agsize=13107200 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=52428800, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=25600, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount /dev/vdd /data
# mkdir /data/data{1,2}
# cd /data/
# chmod 777 *

# for i in {1..8} ; do fallocate -l 5G /data/data1/large_$i ; done
# for i in {1..10000} ; do fallocate -l 4K /data/data2/small_$i ; done

# ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens3    inet 192.168.124.154/24 brd 192.168.124.255 scope global noprefixroute ens3\       valid_lft forever preferred_lft forever
3: ens14    inet 192.168.124.25/24 brd 192.168.124.255 scope global dynamic noprefixroute ens14\       valid_lft 3046sec preferred_lft 3046sec

# IFACE=ens14
# IP1=192.168.124.20
# IP2=192.168.124.138

# tc qdisc add dev $IFACE root handle 1: cbq avpkt 1000 bandwidth 1gbit
# tc filter add dev $IFACE parent 1: protocol ip prio 16 u32 match ip dst $IP1 match ip dport 2049 0xffff flowid 1:1
# tc filter add dev $IFACE parent 1: protocol ip prio 16 u32 match ip dst $IP2 match ip dport 2049 0xffff flowid 1:1

 ### Without namespace mount option

# mount 192.168.124.20:/exports/data1 /mount1 -overs=4.2,sec=sys -vv
mount.nfs: timeout set for Tue Nov 23 10:21:55 2021
mount.nfs: trying text-based options 'vers=4.2,sec=sys,addr=192.168.124.20,clientaddr=192.168.124.25'
192.168.124.20:/exports/data1 on /mount1 type nfs (rw,vers=4.2,sec=sys)

# mount 192.168.124.20:/exports/data2 /mount2 -overs=4.2,sec=sys -vv
mount.nfs: timeout set for Tue Nov 23 10:22:11 2021
mount.nfs: trying text-based options 'vers=4.2,sec=sys,addr=192.168.124.20,clientaddr=192.168.124.25'
192.168.124.20:/exports/data2 on /mount2 type nfs (rw,vers=4.2,sec=sys)

# grep nfs4 /proc/self/mounts
192.168.124.20:/exports/data1 /mount1 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.124.25,local_lock=none,addr=192.168.124.20 0 0
192.168.124.20:/exports/data2 /mount2 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.124.25,local_lock=none,addr=192.168.124.20 0 0

# ss -ptone '( dport = :2049 )' | cat
State Recv-Q Send-Q  Local Address:Port   Peer Address:PortProcess
ESTAB 0      0      192.168.124.25:884  192.168.124.20:2049 timer:(keepalive,8.530ms,0) ino:32906 sk:9fbeb9ae


 * Baseline `# cp`
# time bash -c 'timeout 30m /usr/bin/cp -f /data/data2/* /mount2'

real    11m34.223s
user    0m0.388s
sys     0m4.313s


 * Files created per second
$ echo 'scale=4; 10001 / ((11 * 60) + 34.223)' | bc
14.4060

 * Baseline `# rm`
# time bash -c 'timeout 10m /usr/bin/rm -f /mount2/*'

real    2m24.059s
user    0m0.147s
sys     0m1.279s


 * Files removed per second
$ echo 'scale=4; 10001 / ((2 * 60) + 24.059)' | bc
69.4229

# while [[ 42 ]] ; do /usr/bin/cp -f /data/data1/* /mount1 ; done

# time bash -c 'timeout 30m /usr/bin/cp -f /data/data2/* /mount2'

real    30m0.055s
user    0m0.113s
sys     0m1.839s

# ll /mount2 | wc -l
1130

 * Files created per second
$ echo 'scale=4; 1130 / ((30 * 60) + 0.055)' | bc
.6277

 *** break loop ***

# time bash -c 'timeout 30m /usr/bin/cp -f /data/data2/* /mount2'

real    19m11.659s
user    0m0.353s
sys     0m4.607s


# time bash -c 'timeout 10m /usr/bin/rm -f /mount2/*'

real    10m0.933s
user    0m0.076s
sys     0m1.489s

# ll /mount2 | wc -l
8793

 * Files removed
$ echo $(( 10001 - 8793 ))
1208

 * Files removed per second
$ echo 'scale=4; 1208 / ((10 * 60) + 0.933)' | bc
2.0102


 ### With namespace mount option

# mount 192.168.124.20:/exports/data1 /mount1 -overs=4.2,sec=sys,namespace=data1,clientaddr=192.168.124.25 -vv
mount.nfs: timeout set for Mon Nov 22 14:55:44 2021
mount.nfs: trying text-based options 'vers=4.2,sec=sys,namespace=data1,clientaddr=192.168.124.25,addr=192.168.124.20'
192.168.124.20:/exports/data1 on /mount1 type nfs (rw,vers=4.2,sec=sys,namespace=data1,clientaddr=192.168.124.25)

# mount 192.168.124.138:/exports/data2 /mount2 -overs=4.2,sec=sys,namespace=data2,clientaddr=192.168.124.154 -vv
mount.nfs: timeout set for Mon Nov 22 14:56:22 2021
mount.nfs: trying text-based options 'vers=4.2,sec=sys,namespace=data2,clientaddr=192.168.124.154,addr=192.168.124.138'
192.168.124.138:/exports/data2 on /mount2 type nfs (rw,vers=4.2,sec=sys,namespace=data2,clientaddr=192.168.124.154)

# grep nfs4 /proc/self/mounts
192.168.124.20:/exports/data1 /mount1 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,namespace=data1,timeo=600,retrans=2,sec=sys,clientaddr=192.168.124.25,local_lock=none,addr=192.168.124.20 0 0
192.168.124.138:/exports/data2 /mount2 nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,namespace=data2,timeo=600,retrans=2,sec=sys,clientaddr=192.168.124.154,local_lock=none,addr=192.168.124.138 0 0

# ss -ptone '( dport = :2049 )' | cat
State Recv-Q Send-Q  Local Address:Port    Peer Address:PortProcess                                           
ESTAB 0      0      192.168.124.25:767  192.168.124.138:2049 timer:(keepalive,3.530ms,0) ino:32330 sk:6f6cdf90
ESTAB 0      0      192.168.124.25:1020  192.168.124.20:2049 timer:(keepalive,7.630ms,0) ino:32327 sk:949111d8


# time bash -c 'timeout 30m cp /data/data2/* /mount2'

real    24m30.697s
user    0m0.328s
sys     0m3.998s


 * Files created per second
$ echo 'scale=4; 10001 / ((24 * 60) + 30.697)' | bc
6.8001

# time bash -c 'timeout 10m rm -f /mount2/*'

real    2m0.323s
user    0m0.110s
sys     0m1.490s


 * Files removed per second
$ echo 'scale=4; 10001 / ((2* 60) + 0.323)' | bc
83.1179

# while [[ 42 ]] ; do /usr/bin/cp -f /data/data1/* /mount1 ; done

# time bash -c 'timeout 30m cp /data/data2/* /mount2'

real    30m0.656s
user    0m0.289s
sys     0m2.178s

# ll /exports/data2/* | wc -l
3501

 * Files created per second
$ echo 'scale=4; 3501 / ((30 * 60) + 0.656)' | bc
1.9442

 *** stop the loop ***

# time bash -c 'timeout 30m /usr/bin/cp -f /data/data2/* /mount2'

real    17m18.001s
user    0m0.376s
sys     0m3.526s


 * Files created per second
$ echo 'scale=4; 10001 / ((17 * 60) + 18.001)' | bc
9.6348

# time bash -c 'timeout 10m rm -f /mount2/*'

real    10m0.231s
user    0m0.116s
sys     0m0.967s

# ll /exports/data2/* | wc -l
6292

 * 3709 Files deleted in 10 minutes.
$ echo $(( 10001 - 6292 ))
3709

 * Deleted per second
$ echo 'scale=4; 3709 / ((10 * 60) + 0.231)' | bc
6.1792

Comment 67 Daire Byrne 2021-11-25 11:27:09 UTC

I just thought I'd add, that when I also tested Neil Brown's patches, I only saw minimal improvement in my testing.

My test was similar except I was simply reading files to /dev/null in one namespace (maxing the network) and doing an "ls -lR" in another namespace mount.

In your use case, you would be happy with a separate mount that could be used for fast metadata intensive workloads (scanning a fileystem), but in my case I would prefer that there was a means to have single mount that could "prefer" or prioritise the smaller metadata lookups. I'm not sure how "ls -l" and readdirplus lookups would fit into such a priority scheme.

Ultimately, we want to be able to cache huge datasets on a client (either in pagecache or fscache) but before we can use that cached data (accumulated over days), we need to validate the cache and see if the file changed. And it is the slowness of these lookups on a busy read/write client that hurts our caching performance.

I would love to help identify the bottleneck or test any solutions, but I really don't know where to start.

Daire

Comment 68 Jacob Shivers 2022-01-12 17:57:31 UTC

(In reply to Daire Byrne from comment #67)
> I just thought I'd add, that when I also tested Neil Brown's patches, I only
> saw minimal improvement in my testing.
> 
> My test was similar except I was simply reading files to /dev/null in one
> namespace (maxing the network) and doing an "ls -lR" in another namespace
> mount.
> 
> In your use case, you would be happy with a separate mount that could be
> used for fast metadata intensive workloads (scanning a fileystem), but in my
> case I would prefer that there was a means to have single mount that could
> "prefer" or prioritise the smaller metadata lookups. I'm not sure how "ls
> -l" and readdirplus lookups would fit into such a priority scheme.
> 
> Ultimately, we want to be able to cache huge datasets on a client (either in
> pagecache or fscache) but before we can use that cached data (accumulated
> over days), we need to validate the cache and see if the file changed. And
> it is the slowness of these lookups on a busy read/write client that hurts
> our caching performance.
> 
> I would love to help identify the bottleneck or test any solutions, but I
> really don't know where to start.
> 
> Daire

Hello Daire,

Thank you for reaching out. I had previously considered emailing you to see
what kind of results you had observed in your own testing :)

My primary daily focus has shifted so I generally spend less time on NFS
issues. That being said, even if I had more time to investigate, this issue
does largely seem non-trivial to move forward.

I intend to go back and review responses from the email thread that started
this conversation. Can you state if you have done any further testing or have
any additional observations at this time?

Thanks,
Jacob

Comment 73 RHEL Program Management 2023-09-22 21:06:45 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 74 RHEL Program Management 2023-09-22 21:10:33 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.