Bug 2247718 - Scaling the gateway to 3017 bdevs resulted in a gw exception, and apparent config loss
Summary: Scaling the gateway to 3017 bdevs resulted in a gw exception, and apparent co...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: NVMeOF
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.1
Assignee: Aviv Caro
QA Contact: Paul Cuzner
ceph-doc-bot
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-02 23:25 UTC by Paul Cuzner
Modified: 2024-06-13 14:22 UTC (History)
4 users (show)

Fixed In Version: ceph-18.2.1-175.el9cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-06-13 14:22:55 UTC
Embargoed:


Attachments (Terms of Use)
Scaling script (3.29 KB, application/x-shellscript)
2023-11-02 23:25 UTC, Paul Cuzner
no flags Details
journalctl output from the gateway (673.28 KB, application/gzip)
2023-11-02 23:38 UTC, Paul Cuzner
no flags Details
ls output from /proc/<pid>/fd for the nvmt process (674.69 KB, text/plain)
2023-11-03 03:56 UTC, Paul Cuzner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-nvmeof issues 309 0 None open Gateway stopped at 3017 bdevs, gw restarts but returns an empty config 2023-11-02 23:25:43 UTC
Red Hat Issue Tracker RHCEPH-7847 0 None None None 2023-11-02 23:25:59 UTC
Red Hat Product Errata RHSA-2024:3925 0 None None None 2024-06-13 14:22:58 UTC

Description Paul Cuzner 2023-11-02 23:25:44 UTC
Created attachment 1996856 [details]
Scaling script

Description of problem:
I scaled the gateway by subsystem, where each subsystem contained 32 namespaces, and at the 3017 bdev definition the gateway stopped. A restart of the gateway, shows a omap_version key error, and subsequent commands to the gateway like get_subsystems return nothing.

The omap object is still there and a listomapkeys shows that they are still intact.

Version-Release number of selected component (if applicable):
0.0.4-1

How reproducible:


Steps to Reproduce:
1. Deploy a Gateway
2. configure the scale-gateway script to create the desired configuration
3. run the script

Actual results:
Gateway stops, and no longer returns config state

Expected results:
If a scale limit is reached, I'd expect a soft fail rather than this hard fail where the config is no longer accessible


Additional info:

Comment 1 Paul Cuzner 2023-11-02 23:38:45 UTC
Created attachment 1996858 [details]
journalctl output from the gateway

Comment 2 Paul Cuzner 2023-11-03 02:35:00 UTC
Gil suggested trying a build with PR 272. Switched to 0.0.5 tag and retesting.

Comment 3 Paul Cuzner 2023-11-03 03:53:19 UTC
Same issue with 0.0.5, at around the same point - 95 subsystems, 17/32 bdevs.

The error message is 
2023-11-03T03:34:03.733+0000 7fae1921c780 -1 Errors while parsing config file!
2023-11-03T03:34:03.733+0000 7fae1921c780 -1 can't open ceph.conf: (24) Too many open files
2023-11-03T03:34:03.733+0000 7fae1921c780 -1 ERROR: failed to call res_ninit()
2023-11-03T03:34:03.733+0000 7fae1921c780 -1 monclient: get_monmap_and_config cannot identify monitors to contact

running ulimit -a in the pod shows;
sh-5.1# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 2059423
max locked memory           (kbytes, -l) unlimited
max memory size             (kbytes, -m) unlimited
open files                          (-n) 10240                         <-------
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 1048576
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

# ps -ef | grep nvmf
root     1005431 1005428 99 02:27 ?        09:27:38 /usr/local/bin/nvmf_tgt -u -r /var/tmp/spdk.sock --cpumask=0xFF --msg-mempool-size=524288
root     1920008  969496  0 03:38 pts/1    00:00:00 grep --color=auto nvmf

Looking at the count of active file descriptor for the pid
# ls -l /proc/1005431/fd| wc -l
10240

In this case, 6500 fd's were sockets (output attached)

Comment 4 Paul Cuzner 2023-11-03 03:56:50 UTC
Created attachment 1996868 [details]
ls output from /proc/<pid>/fd for the nvmt process

Comment 5 Paul Cuzner 2023-11-03 04:02:23 UTC
With 0.0.5, after the error is seen, the get_subsystems command still works which is good!

However, when I restart the gateway I see the same issue as observed in 0.0.4-1
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/remote-source/ceph-nvmeof/app/control/state.py", line 420, in _update_caller
self.update()
File "/remote-source/ceph-nvmeof/app/control/state.py", line 434, in update
omap_version = int(omap_state_dict[self.omap.OMAP_VERSION_KEY])
KeyError: 'omap_version'

And now, a get_subsystems just returns the discovery subsystem.
# nvmeof-cli get_subsystems
INFO:__main__:Get subsystems:
[
{
"nqn": "nqn.2014-08.org.nvmexpress.discovery",
"subtype": "Discovery",
"listen_addresses": [],
"allow_any_host": true,
"hosts": []
}
]

Comment 6 Paul Cuzner 2023-11-03 06:39:00 UTC
Is it possible the high number of sockets is related to the bdevs_per_cluster parameter. I'm using the default, but at high numbers of bdevs this is going to mean a high number of librbd threads isn't it?

Comment 7 Ilya Dryomov 2023-11-03 09:39:59 UTC
(In reply to Paul Cuzner from comment #6)
> Is it possible the high number of sockets is related to the
> bdevs_per_cluster parameter. I'm using the default, but at high numbers of
> bdevs this is going to mean a high number of librbd threads isn't it?

I think so.  At this scale, the new default of 8 is not that different from previous 1.

Comment 8 Paul Cuzner 2023-11-05 21:20:21 UTC
I scaled back to 2048 bdevs/namespaces, and the thread count for the gateway process is 3,850. At this level, the ulimit is not hit.

Scaling threads based on bdevs seems like a scaling issue - especially, since some bdevs will not be active and ultimately I/O is limited by the free cycles in the reactor threads not the bdevs themselves.

With the current implementation these threads will likely remain unused.

Looking at the distribution of the threads by name shows (ignores rector threads etc);

1024 safe_timer
 512 io_context_pool
 256 msg_worker_0
 256 msg_worker_1
 256 msg_worker_2
 256 log
 256 service
 256 ceph_timer
 256 ms_dispatch
 256 ms_local
 256 taskfin_librbd

Would is make sense to correlate the number of librbd threads with the number of reactor cores as a default?

Comment 9 Ilya Dryomov 2023-11-06 11:01:03 UTC
(In reply to Paul Cuzner from comment #8)
> Looking at the distribution of the threads by name shows (ignores rector
> threads etc);
> 
> 1024 safe_timer
>  512 io_context_pool
>  256 msg_worker_0
>  256 msg_worker_1
>  256 msg_worker_2
>  256 log
>  256 service
>  256 ceph_timer
>  256 ms_dispatch
>  256 ms_local
>  256 taskfin_librbd

Hrm, someone should probably look at whether having 4 safe_timer threads per librados/librbd client instance is justified.

> 
> Would is make sense to correlate the number of librbd threads with the
> number of reactor cores as a default?

Only the number of io_context_pool and msg_worker threads per librados/librbd client instance can be configured (the default is 2 and 3 respectively).

Instead of messing with individual threads, I would suggest correlating the number of librados/librbd client instances.  A single client should be able to handle a lot more than 8 bdevs.

Comment 10 Paul Cuzner 2023-11-09 04:31:27 UTC
Agree - a single client could. 

I think what you're saying is increase the bdevs_per_cluster from 8, to reduce the number of client threads created. This would work to limit the threads created, but potentially limits the performance potential of the gateway from a client perspective, right? For example, with 8 datastores to get the performance to an acceptable level, I had to drop the bdevs_per_cluster to 4 and add dummy bdevs(72!) to increase the rbd client count.

The balancing act of reactor coremask and bdevs_per_cluster just doesn't feel like a user friendly way to scale the gateway :(

Here's a table showing the threads created for a given number of namespaces, based on different bdevs_per_cluster values
             
namespaces bdevs_per_cluster=8         bdevs_per_cluster=16         bdevs_per_cluster=32
512          threads=960/64 clients      threads=608/32 clients         threads=432/16 clients
1024         threads=1920/128 clients    threads=1216/64 clients        threads=864/32 clients
2048         threads=3840/256 clients    threads=2432/128 clients       threads=1728/64 clients
3072         threads=5760/384 clients    threads=3648/192 clients       threads=2592/96 clients
4096         threads=7680/512 clients    threads=4864/256 clients       threads=3456/128 clients


Given the above, perhaps we should be increasing the ulimit (max open files) on the gateway pod(s) anyway- if we want to support up to 4096 namespaces?

What bugs me is smaller configurations (fewer, larger datastores) will get fewer rbd clients, which will limit performance and ultimately not see the reactors fully utilised. That's why I was suggesting using the reactor cores as the multiplier for the number of librbd clients, instead of the number of namespaces. Is that even feasible?

The only way the current design seems to work is if there are heaps of active datastores - but the issue is that 1 datastore supports many VMs (20-50 rule of thumb), so it's questionable whether this will ever happen.

Comment 11 Ilya Dryomov 2023-11-09 15:32:24 UTC
(In reply to Paul Cuzner from comment #10)
> Agree - a single client could. 
> 
> I think what you're saying is increase the bdevs_per_cluster from 8, to
> reduce the number of client threads created. This would work to limit the
> threads created, but potentially limits the performance potential of the
> gateway from a client perspective, right? For example, with 8 datastores to
> get the performance to an acceptable level, I had to drop the
> bdevs_per_cluster to 4 and add dummy bdevs(72!) to increase the rbd client
> count.

This goes to highlight that bdevs aren't equal.  A bdev which represents a datastore which is home to dozens of VMs is very different from a "regular" bdev which the librados/librbd client instance sharing feature is intended for.  Such a datastore bdev may even need a dedicated client instance, assuming no bottleneck somewhere else in SPDK.

>
> The balancing act of reactor coremask and bdevs_per_cluster just doesn't
> feel like a user friendly way to scale the gateway :(
> 
> Here's a table showing the threads created for a given number of namespaces,
> based on different bdevs_per_cluster values
>              
> namespaces bdevs_per_cluster=8         bdevs_per_cluster=16        
> bdevs_per_cluster=32
> 512          threads=960/64 clients      threads=608/32 clients        
> threads=432/16 clients
> 1024         threads=1920/128 clients    threads=1216/64 clients       
> threads=864/32 clients
> 2048         threads=3840/256 clients    threads=2432/128 clients      
> threads=1728/64 clients
> 3072         threads=5760/384 clients    threads=3648/192 clients      
> threads=2592/96 clients
> 4096         threads=7680/512 clients    threads=4864/256 clients      
> threads=3456/128 clients
> 
> 
> Given the above, perhaps we should be increasing the ulimit (max open files)
> on the gateway pod(s) anyway- if we want to support up to 4096 namespaces?

Increasing ulimits is definitely the way to go if separate librados/librbd client instances (and additional sockets/threads/etc that come with them) are actually needed.  It's the "actually needed" part that I wasn't clear about because I was missing the "massive datastore" use case.

> 
> What bugs me is smaller configurations (fewer, larger datastores) will get
> fewer rbd clients, which will limit performance and ultimately not see the
> reactors fully utilised. That's why I was suggesting using the reactor cores
> as the multiplier for the number of librbd clients, instead of the number of
> namespaces. Is that even feasible?

I think it's a good idea.  It would require code changes in SPDK though.

Comment 12 Aviv Caro 2024-04-07 10:05:20 UTC
Paul - do you think this BZ is still relevant? I suggest to close it, and test what we need for 7.1 which is up to 400 namespaces.

Comment 13 Paul Cuzner 2024-04-07 21:25:38 UTC
I think what you're saying is that since 400 namespaces is the support limit in 7.1 this issue is a low priority - Agree!

For now, I'm happy for you to close this BZ - but we need to plan to push the limits higher to be competitive and more useful to customers. IMO, we shouldn't lose sight of this kind of scale testing going forward.

Comment 16 Rahul Lepakshi 2024-05-22 05:02:20 UTC
Per Paul's comment at https://ibm-systems-storage.slack.com/archives/C05AM6G7ZF1/p1716200930999259 these can be closed.

Comment 17 errata-xmlrpc 2024-06-13 14:22:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925


Note You need to log in before you can comment on or make changes to this bug.