From the log, I see that there are multiple Ganesha instances. I'm not quite sure why the initial instance has 47 threads and subsequent instances get 15 threads each, but therein lies the problem.
(In reply to Frank Filz from comment #22) > From the log, I see that there are multiple Ganesha instances. I'm not quite > sure why the initial instance has 47 threads and subsequent instances get 15 > threads each, but therein lies the problem. Just noting here that yesterday Frank determined that the appearance of multiple Ganesha instances in the customer log came from restarts due to the 4096 limit being exceeded. There is really only one Ganesha instance (linux process). We are reproducing the issue in a lab environment where it appears that the number of libcephs threads may be scaling with the number of exports, but apparently only when the exports are processed before ganesha has initialixed. Adding new exports after ganesha has initialixed does not appear to add to the thread count, but restarting ganesha with existing exports appears to show the scaling issue. Still investigating ...
(In reply to Tom Barron from comment #25) > (In reply to Frank Filz from comment #22) > > From the log, I see that there are multiple Ganesha instances. I'm not quite > > sure why the initial instance has 47 threads and subsequent instances get 15 > > threads each, but therein lies the problem. > > Just noting here that yesterday Frank determined that the appearance of > multiple Ganesha instances in the customer log came from restarts due to the > 4096 limit being exceeded. There is really only one Ganesha instance (linux > process). > > We are reproducing the issue in a lab environment where it appears that the > number of libcephs threads may be scaling with the number of exports, but > apparently only when the exports are processed before ganesha has > initialixed. Adding new exports after ganesha has initialixed does not > appear to add to the thread count, but restarting ganesha with existing > exports appears to show the scaling issue. This is confusing to me. Not sure what it means for exports to be processed "before ganesha has initialized". Jeff, do you know? In any case, each libcephfs instance will spin up a Ceph Client instance. That will result in ~10-20 threads to be created. I don't recall whether the Ganesha ceph FSAL will funnel multiple exports through a single libcephfs handle though (where possible). Jeff can answer that.
No, I don't. libcephfs doesn't really have any awareness of ganesha itself. Probably what you need to do is come up with a reproducer. FSAL_CEPH doesn't make any effort to share libcephfs handles either. Each export is a separate client.
We can reproduce the thread scale up issue without having to restart the ganesha container using a script that adds a unique export client per share. <----> $ ./pid-scaling.sh ---------- 0 exported shares, 49 threads Now create 10 shares Now 10 shares have been created, but they are not exported ---------- 0 exported shares, 49 threads Now export each of the 10 shares in turn ---------- 1 exported shares, 63 threads ---------- 2 exported shares, 77 threads ---------- 3 exported shares, 91 threads ---------- 4 exported shares, 105 threads ---------- 5 exported shares, 119 threads ---------- 6 exported shares, 133 threads ---------- 7 exported shares, 147 threads ---------- 8 exported shares, 161 threads ---------- 9 exported shares, 175 threads ---------- 10 exported shares, 189 threads <----> There is a bump of 14 threads for each share Here is the script: <----> #!/bin/sh function report { echo '----------' echo "$1 exported shares, $2 threads" } function get_num_threads { pids=$(sudo podman exec ceph-nfs-pacemaker cat /sys/fs/cgroup/pids/pids.current) echo $pids } report 0 $(get_num_threads) echo 'Now create 10 shares' for i in $(seq 1 10); do manila create --name Share${i} nfs 1 >& /dev/null done echo 'Now 10 shares have been created, but they are not exported' report 0 $(get_num_threads) echo 'Now export each of the 10 shares in turn' for i in $(seq 1 10); do manila access-allow Share${i} ip 1.0.0.${i} >& /dev/null manila access-list Share${i} >& /dev/null report $i $(get_num_threads) done exit 0 <----> Allowing access to additional CIDRs per share does not seem to affect the scaling. For example, add additional 'manila access-allow' commands in the last loop specifying unique cidrs, like 'manila access-allow Share${i} ip ${i}.0.0.${i}' and rerun the script after deleting the shares created by the first run. The increment is still 14 threads per share. Removing a share, or removing access from a share, decrements the thread count equally, so I don't see thread leakage.
Created attachment 1811839 [details] Threads in ganesha container in lab environment with one un-exported share.
Created attachment 1811840 [details] Threads in ganesha container in lab environment with one *exported* share.
Command to gather threads in the ganesha container: sudo podman exec ceph-nfs-pacemaker ps -eT
(In reply to Jeff Layton from comment #27) > No, I don't. libcephfs doesn't really have any awareness of ganesha itself. > Probably what you need to do is come up with a reproducer. > > FSAL_CEPH doesn't make any effort to share libcephfs handles either. Each > export is a separate client. I don't think that the processing before or after Ganesha is initialized is actually relevant after all. Looks to me like 14 threads are added when a new share is exported even after ganesha has initialized. Reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=1987235#c28 With containerized ganesha and Ceph FSAL we'll hit the default cgroup limit a bit north of 250 exported shares.
Catching up on comments Seems that bumping the PidsLimit attribute for the nfs-ganesha-pacemaker container is the best we can do at this point. According to Podman docs, if we set the PIDsLimit to 0 [0], we allow any number of threads to be run inside the container... and therefore we don't need to fix any specific limit to the container. Worst case scenario, we will have 65535 exports [0]. Actually, is less than that since we leave the first 1000 free. Would this have a negative impact on overall system performance for the controllers? Would this impact other containers health? What do you think? Giulio, if this sounds like a good solution, we would need someone in Ceph Ansible to help us with this. [0] http://docs.podman.io/en/latest/markdown/podman-run.1.html#pids-limit-limit [1] https://github.com/ffilz/nfs-ganesha/blob/next/src/config_samples/export.txt#L102-L104
(In reply to Victoria Martinez de la Cruz from comment #39) > Catching up on comments > > Seems that bumping the PidsLimit attribute for the nfs-ganesha-pacemaker > container is the best we can do at this point. > > According to Podman docs, if we set the PIDsLimit to 0 [0], we allow any > number of threads to be run inside the container... and therefore we don't > need to fix any specific limit to the container. Worst case scenario, we > will have 65535 exports [0]. @youngcheol ^^ --pids-limit seems better than TasksMax in the unit indeed, thanks Victoria for pointing it out > Actually, is less than that since we leave the > first 1000 free. Would this have a negative impact on overall system > performance for the controllers? Would this impact other containers health? > What do you think? Ganesha is colocated with MONs/MGRs/MDSs in the default deployment Director config so these are all good questions; Jeff do we expect as many threads created by MDSs and/or MONs for each (client) spawned by libcephfs?
(In reply to Giulio Fidente from comment #40) > > Ganesha is colocated with MONs/MGRs/MDSs in the default deployment Director > config so these are all good questions; Jeff do we expect as many threads > created by MDSs and/or MONs for each (client) spawned by libcephfs? I'm less well-versed in the server-side internals, but in general, no. The MON and MDS should spawn threads in response to the load placed on them. They don't spin up extra threads to deal with idle clients.
So, conclusions at this point for this case 1. Let's set up a default value to a number that aligns with the amount of shares that prod envs have in average. This needs to be fixed in ceph-ansible. Will open a BZ for them. 2. Provide a way to override this value through THT. 3. Update https://access.redhat.com/articles/1436373 to include this new limit 4. Keep this BZ as tracker, TestOnly for Manila