Created attachment 1996856 [details] Scaling script Description of problem: I scaled the gateway by subsystem, where each subsystem contained 32 namespaces, and at the 3017 bdev definition the gateway stopped. A restart of the gateway, shows a omap_version key error, and subsequent commands to the gateway like get_subsystems return nothing. The omap object is still there and a listomapkeys shows that they are still intact. Version-Release number of selected component (if applicable): 0.0.4-1 How reproducible: Steps to Reproduce: 1. Deploy a Gateway 2. configure the scale-gateway script to create the desired configuration 3. run the script Actual results: Gateway stops, and no longer returns config state Expected results: If a scale limit is reached, I'd expect a soft fail rather than this hard fail where the config is no longer accessible Additional info:
Created attachment 1996858 [details] journalctl output from the gateway
Gil suggested trying a build with PR 272. Switched to 0.0.5 tag and retesting.
Same issue with 0.0.5, at around the same point - 95 subsystems, 17/32 bdevs. The error message is 2023-11-03T03:34:03.733+0000 7fae1921c780 -1 Errors while parsing config file! 2023-11-03T03:34:03.733+0000 7fae1921c780 -1 can't open ceph.conf: (24) Too many open files 2023-11-03T03:34:03.733+0000 7fae1921c780 -1 ERROR: failed to call res_ninit() 2023-11-03T03:34:03.733+0000 7fae1921c780 -1 monclient: get_monmap_and_config cannot identify monitors to contact running ulimit -a in the pod shows; sh-5.1# ulimit -a real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2059423 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 10240 <------- pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1048576 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited # ps -ef | grep nvmf root 1005431 1005428 99 02:27 ? 09:27:38 /usr/local/bin/nvmf_tgt -u -r /var/tmp/spdk.sock --cpumask=0xFF --msg-mempool-size=524288 root 1920008 969496 0 03:38 pts/1 00:00:00 grep --color=auto nvmf Looking at the count of active file descriptor for the pid # ls -l /proc/1005431/fd| wc -l 10240 In this case, 6500 fd's were sockets (output attached)
Created attachment 1996868 [details] ls output from /proc/<pid>/fd for the nvmt process
With 0.0.5, after the error is seen, the get_subsystems command still works which is good! However, when I restart the gateway I see the same issue as observed in 0.0.4-1 Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/usr/lib64/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/remote-source/ceph-nvmeof/app/control/state.py", line 420, in _update_caller self.update() File "/remote-source/ceph-nvmeof/app/control/state.py", line 434, in update omap_version = int(omap_state_dict[self.omap.OMAP_VERSION_KEY]) KeyError: 'omap_version' And now, a get_subsystems just returns the discovery subsystem. # nvmeof-cli get_subsystems INFO:__main__:Get subsystems: [ { "nqn": "nqn.2014-08.org.nvmexpress.discovery", "subtype": "Discovery", "listen_addresses": [], "allow_any_host": true, "hosts": [] } ]
Is it possible the high number of sockets is related to the bdevs_per_cluster parameter. I'm using the default, but at high numbers of bdevs this is going to mean a high number of librbd threads isn't it?
(In reply to Paul Cuzner from comment #6) > Is it possible the high number of sockets is related to the > bdevs_per_cluster parameter. I'm using the default, but at high numbers of > bdevs this is going to mean a high number of librbd threads isn't it? I think so. At this scale, the new default of 8 is not that different from previous 1.
I scaled back to 2048 bdevs/namespaces, and the thread count for the gateway process is 3,850. At this level, the ulimit is not hit. Scaling threads based on bdevs seems like a scaling issue - especially, since some bdevs will not be active and ultimately I/O is limited by the free cycles in the reactor threads not the bdevs themselves. With the current implementation these threads will likely remain unused. Looking at the distribution of the threads by name shows (ignores rector threads etc); 1024 safe_timer 512 io_context_pool 256 msg_worker_0 256 msg_worker_1 256 msg_worker_2 256 log 256 service 256 ceph_timer 256 ms_dispatch 256 ms_local 256 taskfin_librbd Would is make sense to correlate the number of librbd threads with the number of reactor cores as a default?
(In reply to Paul Cuzner from comment #8) > Looking at the distribution of the threads by name shows (ignores rector > threads etc); > > 1024 safe_timer > 512 io_context_pool > 256 msg_worker_0 > 256 msg_worker_1 > 256 msg_worker_2 > 256 log > 256 service > 256 ceph_timer > 256 ms_dispatch > 256 ms_local > 256 taskfin_librbd Hrm, someone should probably look at whether having 4 safe_timer threads per librados/librbd client instance is justified. > > Would is make sense to correlate the number of librbd threads with the > number of reactor cores as a default? Only the number of io_context_pool and msg_worker threads per librados/librbd client instance can be configured (the default is 2 and 3 respectively). Instead of messing with individual threads, I would suggest correlating the number of librados/librbd client instances. A single client should be able to handle a lot more than 8 bdevs.
Agree - a single client could. I think what you're saying is increase the bdevs_per_cluster from 8, to reduce the number of client threads created. This would work to limit the threads created, but potentially limits the performance potential of the gateway from a client perspective, right? For example, with 8 datastores to get the performance to an acceptable level, I had to drop the bdevs_per_cluster to 4 and add dummy bdevs(72!) to increase the rbd client count. The balancing act of reactor coremask and bdevs_per_cluster just doesn't feel like a user friendly way to scale the gateway :( Here's a table showing the threads created for a given number of namespaces, based on different bdevs_per_cluster values namespaces bdevs_per_cluster=8 bdevs_per_cluster=16 bdevs_per_cluster=32 512 threads=960/64 clients threads=608/32 clients threads=432/16 clients 1024 threads=1920/128 clients threads=1216/64 clients threads=864/32 clients 2048 threads=3840/256 clients threads=2432/128 clients threads=1728/64 clients 3072 threads=5760/384 clients threads=3648/192 clients threads=2592/96 clients 4096 threads=7680/512 clients threads=4864/256 clients threads=3456/128 clients Given the above, perhaps we should be increasing the ulimit (max open files) on the gateway pod(s) anyway- if we want to support up to 4096 namespaces? What bugs me is smaller configurations (fewer, larger datastores) will get fewer rbd clients, which will limit performance and ultimately not see the reactors fully utilised. That's why I was suggesting using the reactor cores as the multiplier for the number of librbd clients, instead of the number of namespaces. Is that even feasible? The only way the current design seems to work is if there are heaps of active datastores - but the issue is that 1 datastore supports many VMs (20-50 rule of thumb), so it's questionable whether this will ever happen.
(In reply to Paul Cuzner from comment #10) > Agree - a single client could. > > I think what you're saying is increase the bdevs_per_cluster from 8, to > reduce the number of client threads created. This would work to limit the > threads created, but potentially limits the performance potential of the > gateway from a client perspective, right? For example, with 8 datastores to > get the performance to an acceptable level, I had to drop the > bdevs_per_cluster to 4 and add dummy bdevs(72!) to increase the rbd client > count. This goes to highlight that bdevs aren't equal. A bdev which represents a datastore which is home to dozens of VMs is very different from a "regular" bdev which the librados/librbd client instance sharing feature is intended for. Such a datastore bdev may even need a dedicated client instance, assuming no bottleneck somewhere else in SPDK. > > The balancing act of reactor coremask and bdevs_per_cluster just doesn't > feel like a user friendly way to scale the gateway :( > > Here's a table showing the threads created for a given number of namespaces, > based on different bdevs_per_cluster values > > namespaces bdevs_per_cluster=8 bdevs_per_cluster=16 > bdevs_per_cluster=32 > 512 threads=960/64 clients threads=608/32 clients > threads=432/16 clients > 1024 threads=1920/128 clients threads=1216/64 clients > threads=864/32 clients > 2048 threads=3840/256 clients threads=2432/128 clients > threads=1728/64 clients > 3072 threads=5760/384 clients threads=3648/192 clients > threads=2592/96 clients > 4096 threads=7680/512 clients threads=4864/256 clients > threads=3456/128 clients > > > Given the above, perhaps we should be increasing the ulimit (max open files) > on the gateway pod(s) anyway- if we want to support up to 4096 namespaces? Increasing ulimits is definitely the way to go if separate librados/librbd client instances (and additional sockets/threads/etc that come with them) are actually needed. It's the "actually needed" part that I wasn't clear about because I was missing the "massive datastore" use case. > > What bugs me is smaller configurations (fewer, larger datastores) will get > fewer rbd clients, which will limit performance and ultimately not see the > reactors fully utilised. That's why I was suggesting using the reactor cores > as the multiplier for the number of librbd clients, instead of the number of > namespaces. Is that even feasible? I think it's a good idea. It would require code changes in SPDK though.
Paul - do you think this BZ is still relevant? I suggest to close it, and test what we need for 7.1 which is up to 400 namespaces.
I think what you're saying is that since 400 namespaces is the support limit in 7.1 this issue is a low priority - Agree! For now, I'm happy for you to close this BZ - but we need to plan to push the limits higher to be competitive and more useful to customers. IMO, we shouldn't lose sight of this kind of scale testing going forward.
Per Paul's comment at https://ibm-systems-storage.slack.com/archives/C05AM6G7ZF1/p1716200930999259 these can be closed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3925