1987235 – [Test Only] Couldn't create Manila shares - ceph-nfs-pacemaker is dumped

Bug 1987235 - [Test Only] Couldn't create Manila shares - ceph-nfs-pacemaker is dumped

Summary: [Test Only] Couldn't create Manila shares - ceph-nfs-pacemaker is dumped

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-manila
Sub Component:
Version:	16.1 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	z8
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Victoria Martinez de la Cruz
QA Contact:	vhariria
Docs Contact:
URL:
Whiteboard:
Depends On:	1987041 1992473
Blocks:	1993210
TreeView+	depends on / blocked

Reported:	2021-07-29 09:36 UTC by youngcheol
Modified:	2024-12-20 20:34 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1993210 (view as bug list)
Environment:
Last Closed:	2024-01-12 16:35:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-6737	0	None	None	None	2021-11-15 13:00:52 UTC
Red Hat Knowledge Base (Solution)	6431861	0	None	None	None	2021-10-19 04:58:34 UTC

Comment 22 Frank Filz 2021-08-05 15:05:17 UTC

From the log, I see that there are multiple Ganesha instances. I'm not quite sure why the initial instance has 47 threads and subsequent instances get 15 threads each, but therein lies the problem.

Comment 25 Tom Barron 2021-08-06 12:51:10 UTC

(In reply to Frank Filz from comment #22)
> From the log, I see that there are multiple Ganesha instances. I'm not quite
> sure why the initial instance has 47 threads and subsequent instances get 15
> threads each, but therein lies the problem.

Just noting here that yesterday Frank determined that the appearance of multiple Ganesha instances in the customer log came from restarts due to the 4096 limit being exceeded.  There is really only one Ganesha instance (linux process).

We are reproducing the issue in a lab environment where it appears that the number of libcephs threads may be scaling with the number of exports, but apparently only when the exports are processed before ganesha has initialixed.  Adding new exports after ganesha has initialixed does not appear to add to the thread count, but restarting ganesha with existing exports appears to show the scaling issue.

Still investigating ...

Comment 26 Patrick Donnelly 2021-08-06 18:29:12 UTC

(In reply to Tom Barron from comment #25)
> (In reply to Frank Filz from comment #22)
> > From the log, I see that there are multiple Ganesha instances. I'm not quite
> > sure why the initial instance has 47 threads and subsequent instances get 15
> > threads each, but therein lies the problem.
> 
> Just noting here that yesterday Frank determined that the appearance of
> multiple Ganesha instances in the customer log came from restarts due to the
> 4096 limit being exceeded.  There is really only one Ganesha instance (linux
> process).
> 
> We are reproducing the issue in a lab environment where it appears that the
> number of libcephs threads may be scaling with the number of exports, but
> apparently only when the exports are processed before ganesha has
> initialixed.  Adding new exports after ganesha has initialixed does not
> appear to add to the thread count, but restarting ganesha with existing
> exports appears to show the scaling issue.

This is confusing to me. Not sure what it means for exports to be processed "before ganesha has initialized". Jeff, do you know?

In any case, each libcephfs instance will spin up a Ceph Client instance. That will result in ~10-20 threads to be created. I don't recall whether the Ganesha ceph FSAL will funnel multiple exports through a single libcephfs handle though (where possible). Jeff can answer that.

Comment 27 Jeff Layton 2021-08-06 18:46:22 UTC

No, I don't. libcephfs doesn't really have any awareness of ganesha itself. Probably what you need to do is come up with a reproducer.

FSAL_CEPH doesn't make any effort to share libcephfs handles either. Each export is a separate client.

Comment 28 Tom Barron 2021-08-07 12:07:43 UTC

We can reproduce the thread scale up issue without having to restart the ganesha container using a script that adds a unique export client per share.

<---->
$ ./pid-scaling.sh
----------
0 exported shares, 49 threads
Now create 10 shares
Now 10 shares have been created, but they are not exported
----------
0 exported shares, 49 threads
Now export each of the 10 shares in turn
----------
1 exported shares, 63 threads
----------
2 exported shares, 77 threads
----------
3 exported shares, 91 threads
----------
4 exported shares, 105 threads
----------
5 exported shares, 119 threads
----------
6 exported shares, 133 threads
----------
7 exported shares, 147 threads
----------
8 exported shares, 161 threads
----------
9 exported shares, 175 threads
----------
10 exported shares, 189 threads

<---->
There is a bump of 14 threads for each share 

Here is the script:


<---->
#!/bin/sh

function report {
    echo '----------'
    echo "$1 exported shares, $2 threads"
}

function get_num_threads {
    pids=$(sudo podman exec ceph-nfs-pacemaker cat /sys/fs/cgroup/pids/pids.current)
    echo $pids
}

report 0 $(get_num_threads)

echo 'Now create 10 shares'
for i in $(seq 1 10); do
    manila create --name Share${i} nfs 1 >& /dev/null
done

echo 'Now 10 shares have been created, but they are not exported'
report 0 $(get_num_threads)

echo 'Now export each of the 10 shares in turn'
for i in $(seq 1 10); do
    manila access-allow Share${i} ip 1.0.0.${i} >& /dev/null
    manila access-list Share${i} >& /dev/null
    report $i $(get_num_threads)
done

exit 0

<---->

Allowing access to additional CIDRs per share does not seem to affect the scaling.

For example, add additional 'manila access-allow' commands in the last loop specifying unique cidrs, like 'manila access-allow Share${i} ip ${i}.0.0.${i}' and rerun the script after deleting the shares created by the first run.  The increment is still 14 threads per share.

Removing a share, or removing access from a share, decrements the thread count equally, so I don't see thread leakage.

Comment 29 Tom Barron 2021-08-07 12:52:06 UTC

Created attachment 1811839 [details]
Threads in ganesha container in lab environment with one un-exported share.

Comment 30 Tom Barron 2021-08-07 12:53:34 UTC

Created attachment 1811840 [details]
Threads in ganesha container in lab environment with one *exported* share.

Comment 31 Tom Barron 2021-08-07 12:57:20 UTC

Command to gather threads in the ganesha container: sudo podman exec ceph-nfs-pacemaker ps -eT

Comment 32 Tom Barron 2021-08-07 13:06:16 UTC

(In reply to Jeff Layton from comment #27)
> No, I don't. libcephfs doesn't really have any awareness of ganesha itself.
> Probably what you need to do is come up with a reproducer.
> 
> FSAL_CEPH doesn't make any effort to share libcephfs handles either. Each
> export is a separate client.

I don't think that the processing before or after Ganesha is initialized is actually relevant after all.  Looks to me like 14 threads are added when a new share is exported even after ganesha has initialized.

Reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=1987235#c28

With containerized ganesha and Ceph FSAL we'll hit the default cgroup limit a bit north of 250 exported shares.

Comment 39 Victoria Martinez de la Cruz 2021-08-09 19:39:31 UTC

Catching up on comments

Seems that bumping the PidsLimit attribute for the nfs-ganesha-pacemaker container is the best we can do at this point.

According to Podman docs, if we set the PIDsLimit to 0 [0], we allow any number of threads to be run inside the container... and therefore we don't need to fix any specific limit to the container. Worst case scenario, we will have 65535 exports [0]. Actually, is less than that since we leave the first 1000 free. Would this have a negative impact on overall system performance for the controllers? Would this impact other containers health? What do you think?

Giulio, if this sounds like a good solution, we would need someone in Ceph Ansible to help us with this.

[0] http://docs.podman.io/en/latest/markdown/podman-run.1.html#pids-limit-limit
[1] https://github.com/ffilz/nfs-ganesha/blob/next/src/config_samples/export.txt#L102-L104

Comment 40 Giulio Fidente 2021-08-10 12:18:25 UTC

(In reply to Victoria Martinez de la Cruz from comment #39)
> Catching up on comments
> 
> Seems that bumping the PidsLimit attribute for the nfs-ganesha-pacemaker
> container is the best we can do at this point.
> 
> According to Podman docs, if we set the PIDsLimit to 0 [0], we allow any
> number of threads to be run inside the container... and therefore we don't
> need to fix any specific limit to the container. Worst case scenario, we
> will have 65535 exports [0].

@youngcheol ^^ --pids-limit seems better than TasksMax in the unit indeed, thanks Victoria for pointing it out

> Actually, is less than that since we leave the
> first 1000 free. Would this have a negative impact on overall system
> performance for the controllers? Would this impact other containers health?
> What do you think?

Ganesha is colocated with MONs/MGRs/MDSs in the default deployment Director config so these are all good questions; Jeff do we expect as many threads created by MDSs and/or MONs for each (client) spawned by libcephfs?

Comment 41 Jeff Layton 2021-08-10 12:52:09 UTC

(In reply to Giulio Fidente from comment #40)
> 
> Ganesha is colocated with MONs/MGRs/MDSs in the default deployment Director
> config so these are all good questions; Jeff do we expect as many threads
> created by MDSs and/or MONs for each (client) spawned by libcephfs?

I'm less well-versed in the server-side internals, but in general, no. The MON and MDS should spawn threads in response to the load placed on them. They don't spin up extra threads to deal with idle clients.

Comment 42 Victoria Martinez de la Cruz 2021-08-10 13:26:14 UTC

So, conclusions at this point for this case

1. Let's set up a default value to a number that aligns with the amount of shares that prod envs have in average. This needs to be fixed in ceph-ansible. Will open a BZ for them.
2. Provide a way to override this value through THT.
3. Update https://access.redhat.com/articles/1436373 to include this new limit
4. Keep this BZ as tracker, TestOnly for Manila

Note You need to log in before you can comment on or make changes to this bug.