Bug 1956285

Summary:	[must-gather] log collection for some ceph cmd failed with timeout: fork system call failed: Resource temporarily unavailable
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Pratik Surve <prsurve>
Component:	must-gather	Assignee:	Rewant <resoni>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.7	CC:	branto, ebenahar, kramdoss, muagarwa, ocs-bugs, odf-bz-bot, resoni, sabose, tdesala
Target Milestone:	---	Keywords:	AutomationBackLog, Regression
Target Release:	ODF 4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-13 17:44:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pratik Surve 2021-05-03 11:52:18 UTC

Description of problem (please be detailed as possible and provide log
snippests):

log collection for some ceph cmd failed with timeout: fork system call failed: Resource temporarily unavailable

Version of all relevant components (if applicable):

OCP version:- 4.7.0-0.nightly-2021-05-01-081439
OCS version:- 4.7.0-364.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Yes it's regression

Steps to Reproduce:
1. Deploy OCP, OCS 4.7
2. run some i/o 
3. Collect must-gather logs


Actual results:
runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailable
pthread_create failed: Resource temporarily unavailable

Expected results:
ceph cmd output should be collected

Additional info:

Comment 4 RAJAT SINGH 2021-05-04 07:56:12 UTC

The issue here is that the CRI-O default PID limit is 1024. And I am assuming that the nodes on which the MG was running have already had all the PIDs allocated, in order to fix this, one has to increase the PID limit from 1024, which is by default the maximum limit. The steps are  written here https://access.redhat.com/solutions/5597061

Once done, the nodes will be updated one by one.
Here's the output from a node before applying the fix

sh-4.4# crio config | grep pids_limit
time="2021-05-04T07:27:10Z" level=info msg="Starting CRI-O, version: 1.20.2-10.rhaos4.7.gitfc8b9e9.el8, git: ()"
level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL"
pids_limit = 1024

Here's AFTER applying the fix.

sh-4.4# crio config | grep pids_limit
INFO[0000] Starting CRI-O, version: 1.20.2-10.rhaos4.7.gitfc8b9e9.el8, git: ()
INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL
pids_limit = 2048
sh-4.4#

Comment 5 RAJAT SINGH 2021-05-04 08:30:49 UTC

Here's the file on the server http://magna012.ceph.redhat.com/bz/bz_1956285/quay-io-ocs-dev-ocs-must-gather-sha256-2dc20d4b2e8fc7000b2c7d7a31646b60f137d777e51868cf1be04ac86fc9bc15/
try searching for runtime/cgo. The issue was the one I explained above. Moving it to POST, please comment to that I can change states if needed.

Comment 6 RAJAT SINGH 2021-05-04 09:39:12 UTC

Marking as NEW again since this issue is fixed with the above patch if used use quay.io/ocs-dev/ocs-must-gather image but fails if used quay.io/rhceph-dev/ocs-must-gather:latest-4.7 imag
@branto can you please shed some lights on the behavior of http://quay.io/rhceph-dev repo and why he image from that repo fails to give the same result as the Downstream Built?.
Thanks

Comment 8 RAJAT SINGH 2021-05-06 12:12:22 UTC

Hi Bosirs, with the patch, I meant the fix that I applied from here https://access.redhat.com/solutions/5597061 it increases the PIDs so that the mg script can run, but the issue still persists with the Downstream image, that is why I asked you if there's something happening on your end since after applying the fix above, the Upstream image solves this issue.

Comment 9 Boris Ranto 2021-05-10 08:40:16 UTC

Where did you apply the fix? In the Dockerfile? Somewhere else?

Comment 10 RAJAT SINGH 2021-05-17 09:32:07 UTC

In the cluster itself, please take a look at the fix itself.
https://access.redhat.com/solutions/5597061

Comment 11 Boris Ranto 2021-05-17 10:25:44 UTC

Did you apply it manually? How is this supposed to be automated?

Comment 12 RAJAT SINGH 2021-05-18 08:14:42 UTC

I apply it manually tho, but if it kept happening again, the platform team has to automate it or increase the default no of pids.

Comment 13 Boris Ranto 2021-05-18 08:56:37 UTC

OK, it looks like this requires some i/o to be run to hit this, right? The article describes how to fix this in that case. I presume the i/o takes up some pids and we are hitting a default pid limit.

There was no code change in the upstream image right? You only applied the fix manually right? I presume we might have been hitting even the increased pid limit with the downstream image then (depending on how much i/o was occurring).

Is this reproducible regularly? Should we try to automate it in some way?

Comment 14 RAJAT SINGH 2021-05-18 10:10:13 UTC

(In reply to Boris Ranto from comment #13)
> OK, it looks like this requires some i/o to be run to hit this, right? The
> article describes how to fix this in that case. I presume the i/o takes up
> some pids and we are hitting a default pid limit.
> 
> There was no code change in the upstream image right? You only applied the
> fix manually right? I presume we might have been hitting even the increased
> pid limit with the downstream image then (depending on how much i/o was
> occurring).
Yes, there was no code change but this is occurring just in Upstream and not in Downstream. This was filed under Must Gather runs some tasks in the background. My concern was if there's something happening in the Upstream repo which is causing this change.
> 
> Is this reproducible regularly? Should we try to automate it in some way?

Yes, it's reproducible all the time, just run the must gather's upstream branch and it will happen again.

Comment 15 Mudit Agarwal 2021-06-06 12:05:54 UTC

Pratik, 
Are you able to apply the workaround suggested by Rajat in https://bugzilla.redhat.com/show_bug.cgi?id=1956285#c4?
Also, is this reproducible in 4.8?

Rajat,
What is the customer impact if this is present only in upstream? Do the customers use rhceph-dev to run must-gather downstream?

Comment 16 Mudit Agarwal 2021-06-10 10:01:34 UTC

We have a workaround, not urgent enough to fix in 4.8

Comment 18 Mudit Agarwal 2021-07-06 14:48:50 UTC

After the offline discussion (Elad, Pratik, Parth, Raz, Oded) removing the blocker flag.

Here is the summary:
WA is already documented https://access.redhat.com/solutions/5597061

work on a permanent solution, one of or a combination of:  
  - decrease the number of commands being fired at once, which will probably cause MG to take more time
  - list the failed commands and re-collect them at the end

this permanent solution we can track with https://bugzilla.redhat.com/show_bug.cgi?id=1956285 for 4.8.z or 4.9

Comment 26 Mudit Agarwal 2021-09-22 09:20:25 UTC

Fix should be available in the latest ODF builds

Comment 31 Oded 2021-10-19 14:05:46 UTC

Bug reconstructed, fork system call failed

SetUp:
Provider:BareMetal
OCP Version: 4.9.0-0.nightly-2021-10-16-173626
ODF Version: 4.9.0-192.ci
LSO Version: local-storage-operator.4.9.0-202110121402

Test Proceudure:
1.Run must-gather command:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9

[must-gather-hlnss] POD collecting command output for: ceph osd blocked-by
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found
[must-gather-hlnss] POD collecting command output for: ceph osd blacklist ls
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found
[must-gather-hlnss] POD collecting command output for: ceph pg dump
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found
[must-gather-hlnss] POD collecting command output for: ceph pg stat
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found
[must-gather-hlnss] POD collecting command output for: ceph pool autoscale-status
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found
[must-gather-hlnss] POD collecting command output for: ceph progress
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable
[must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: Interrupted system call

2.Check MG directory:

Some files do not exist on mg directory:
['ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem', 'ceph_mds_stat', 'ceph_osd_crush_dump', 'ceph_osd_crush_show-tunables', 'ceph_osd_crush_weight-set_dump', 'ceph_osd_df_tree', 'ceph_osd_utilization']

['ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem_--format_json-pretty', 'ceph_mgr_services_--format_json-pretty', 'ceph_osd_crush_dump_--format_json-pretty', 'ceph_osd_crush_weight-set_dump_--format_json-pretty', 'ceph_osd_crush_weight-set_ls_--format_json-pretty', 'ceph_osd_df_tree_--format_json-pretty', 'ceph_osd_dump_--format_json-pretty', 'ceph_osd_tree_--format_json-pretty']

Comment 33 Mudit Agarwal 2021-11-02 14:46:21 UTC

As discussed here https://chat.google.com/room/AAAAsMRYD8Y/vAhyFFGPVnY, this is not seen with the official image but with the rhceph-dev image.
Rewant is trying to find the difference, but not a 4.9 blocker.
Moving it out, please revert back if required.

Comment 39 Mudit Agarwal 2021-11-08 12:39:54 UTC

As this will be available with the next 4.9 downstream build, moving it back to 4.9

Comment 43 Oded 2021-11-14 22:56:18 UTC

Bug fixed.

SetUp:
Provider:BareMetal
OCP Version: 4.9.0-0.nightly-2021-11-10-215111
ODF Version: 4.9.0-233.ci
LSO Version: local-storage-operator.4.9.0-202111020858


Test Procesure:
1.Run must-gather command:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9

2.Check mg log -> threre is no "fork system call failed" log

3.Check all relevant files exist on mg dir

Comment 45 errata-xmlrpc 2021-12-13 17:44:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086