Bug 1956285 - [must-gather] log collection for some ceph cmd failed with timeout: fork system call failed: Resource temporarily unavailable
Summary: [must-gather] log collection for some ceph cmd failed with timeout: fork syst...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat
Component: must-gather
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: RAJAT SINGH
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-03 11:52 UTC by Pratik Surve
Modified: 2021-05-17 10:25 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Pratik Surve 2021-05-03 11:52:18 UTC
Description of problem (please be detailed as possible and provide log
snippests):

log collection for some ceph cmd failed with timeout: fork system call failed: Resource temporarily unavailable

Version of all relevant components (if applicable):

OCP version:- 4.7.0-0.nightly-2021-05-01-081439
OCS version:- 4.7.0-364.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Yes it's regression

Steps to Reproduce:
1. Deploy OCP, OCS 4.7
2. run some i/o 
3. Collect must-gather logs


Actual results:
runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailable
pthread_create failed: Resource temporarily unavailable

Expected results:
ceph cmd output should be collected

Additional info:

Comment 4 RAJAT SINGH 2021-05-04 07:56:12 UTC
The issue here is that the CRI-O default PID limit is 1024. And I am assuming that the nodes on which the MG was running have already had all the PIDs allocated, in order to fix this, one has to increase the PID limit from 1024, which is by default the maximum limit. The steps are  written here https://access.redhat.com/solutions/5597061

Once done, the nodes will be updated one by one.
Here's the output from a node before applying the fix

sh-4.4# crio config | grep pids_limit
time="2021-05-04T07:27:10Z" level=info msg="Starting CRI-O, version: 1.20.2-10.rhaos4.7.gitfc8b9e9.el8, git: ()"
level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL"
pids_limit = 1024

Here's AFTER applying the fix.

sh-4.4# crio config | grep pids_limit
INFO[0000] Starting CRI-O, version: 1.20.2-10.rhaos4.7.gitfc8b9e9.el8, git: ()
INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL
pids_limit = 2048
sh-4.4#

Comment 5 RAJAT SINGH 2021-05-04 08:30:49 UTC
Here's the file on the server http://magna012.ceph.redhat.com/bz/bz_1956285/quay-io-ocs-dev-ocs-must-gather-sha256-2dc20d4b2e8fc7000b2c7d7a31646b60f137d777e51868cf1be04ac86fc9bc15/
try searching for runtime/cgo. The issue was the one I explained above. Moving it to POST, please comment to that I can change states if needed.

Comment 6 RAJAT SINGH 2021-05-04 09:39:12 UTC
Marking as NEW again since this issue is fixed with the above patch if used use quay.io/ocs-dev/ocs-must-gather image but fails if used quay.io/rhceph-dev/ocs-must-gather:latest-4.7 imag
@branto@redhat.com can you please shed some lights on the behavior of http://quay.io/rhceph-dev repo and why he image from that repo fails to give the same result as the Downstream Built?.
Thanks

Comment 8 RAJAT SINGH 2021-05-06 12:12:22 UTC
Hi Bosirs, with the patch, I meant the fix that I applied from here https://access.redhat.com/solutions/5597061 it increases the PIDs so that the mg script can run, but the issue still persists with the Downstream image, that is why I asked you if there's something happening on your end since after applying the fix above, the Upstream image solves this issue.

Comment 9 Boris Ranto 2021-05-10 08:40:16 UTC
Where did you apply the fix? In the Dockerfile? Somewhere else?

Comment 10 RAJAT SINGH 2021-05-17 09:32:07 UTC
In the cluster itself, please take a look at the fix itself.
https://access.redhat.com/solutions/5597061

Comment 11 Boris Ranto 2021-05-17 10:25:44 UTC
Did you apply it manually? How is this supposed to be automated?


Note You need to log in before you can comment on or make changes to this bug.