Description of problem (please be detailed as possible and provide log snippests): log collection for some ceph cmd failed with timeout: fork system call failed: Resource temporarily unavailable Version of all relevant components (if applicable): OCP version:- 4.7.0-0.nightly-2021-05-01-081439 OCS version:- 4.7.0-364.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Yes it's regression Steps to Reproduce: 1. Deploy OCP, OCS 4.7 2. run some i/o 3. Collect must-gather logs Actual results: runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailable pthread_create failed: Resource temporarily unavailable Expected results: ceph cmd output should be collected Additional info:
The issue here is that the CRI-O default PID limit is 1024. And I am assuming that the nodes on which the MG was running have already had all the PIDs allocated, in order to fix this, one has to increase the PID limit from 1024, which is by default the maximum limit. The steps are written here https://access.redhat.com/solutions/5597061 Once done, the nodes will be updated one by one. Here's the output from a node before applying the fix sh-4.4# crio config | grep pids_limit time="2021-05-04T07:27:10Z" level=info msg="Starting CRI-O, version: 1.20.2-10.rhaos4.7.gitfc8b9e9.el8, git: ()" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL" pids_limit = 1024 Here's AFTER applying the fix. sh-4.4# crio config | grep pids_limit INFO[0000] Starting CRI-O, version: 1.20.2-10.rhaos4.7.gitfc8b9e9.el8, git: () INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL pids_limit = 2048 sh-4.4#
Here's the file on the server http://magna012.ceph.redhat.com/bz/bz_1956285/quay-io-ocs-dev-ocs-must-gather-sha256-2dc20d4b2e8fc7000b2c7d7a31646b60f137d777e51868cf1be04ac86fc9bc15/ try searching for runtime/cgo. The issue was the one I explained above. Moving it to POST, please comment to that I can change states if needed.
Marking as NEW again since this issue is fixed with the above patch if used use quay.io/ocs-dev/ocs-must-gather image but fails if used quay.io/rhceph-dev/ocs-must-gather:latest-4.7 imag @branto can you please shed some lights on the behavior of http://quay.io/rhceph-dev repo and why he image from that repo fails to give the same result as the Downstream Built?. Thanks
Hi Bosirs, with the patch, I meant the fix that I applied from here https://access.redhat.com/solutions/5597061 it increases the PIDs so that the mg script can run, but the issue still persists with the Downstream image, that is why I asked you if there's something happening on your end since after applying the fix above, the Upstream image solves this issue.
Where did you apply the fix? In the Dockerfile? Somewhere else?
In the cluster itself, please take a look at the fix itself. https://access.redhat.com/solutions/5597061
Did you apply it manually? How is this supposed to be automated?
I apply it manually tho, but if it kept happening again, the platform team has to automate it or increase the default no of pids.
OK, it looks like this requires some i/o to be run to hit this, right? The article describes how to fix this in that case. I presume the i/o takes up some pids and we are hitting a default pid limit. There was no code change in the upstream image right? You only applied the fix manually right? I presume we might have been hitting even the increased pid limit with the downstream image then (depending on how much i/o was occurring). Is this reproducible regularly? Should we try to automate it in some way?
(In reply to Boris Ranto from comment #13) > OK, it looks like this requires some i/o to be run to hit this, right? The > article describes how to fix this in that case. I presume the i/o takes up > some pids and we are hitting a default pid limit. > > There was no code change in the upstream image right? You only applied the > fix manually right? I presume we might have been hitting even the increased > pid limit with the downstream image then (depending on how much i/o was > occurring). Yes, there was no code change but this is occurring just in Upstream and not in Downstream. This was filed under Must Gather runs some tasks in the background. My concern was if there's something happening in the Upstream repo which is causing this change. > > Is this reproducible regularly? Should we try to automate it in some way? Yes, it's reproducible all the time, just run the must gather's upstream branch and it will happen again.
Pratik, Are you able to apply the workaround suggested by Rajat in https://bugzilla.redhat.com/show_bug.cgi?id=1956285#c4? Also, is this reproducible in 4.8? Rajat, What is the customer impact if this is present only in upstream? Do the customers use rhceph-dev to run must-gather downstream?
We have a workaround, not urgent enough to fix in 4.8
After the offline discussion (Elad, Pratik, Parth, Raz, Oded) removing the blocker flag. Here is the summary: WA is already documented https://access.redhat.com/solutions/5597061 work on a permanent solution, one of or a combination of: - decrease the number of commands being fired at once, which will probably cause MG to take more time - list the failed commands and re-collect them at the end this permanent solution we can track with https://bugzilla.redhat.com/show_bug.cgi?id=1956285 for 4.8.z or 4.9
Fix should be available in the latest ODF builds
Bug reconstructed, fork system call failed SetUp: Provider:BareMetal OCP Version: 4.9.0-0.nightly-2021-10-16-173626 ODF Version: 4.9.0-192.ci LSO Version: local-storage-operator.4.9.0-202110121402 Test Proceudure: 1.Run must-gather command: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9 [must-gather-hlnss] POD collecting command output for: ceph osd blocked-by [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph osd blacklist ls [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph pg dump [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph pg stat [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph pool autoscale-status [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph progress [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: Interrupted system call 2.Check MG directory: Some files do not exist on mg directory: ['ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem', 'ceph_mds_stat', 'ceph_osd_crush_dump', 'ceph_osd_crush_show-tunables', 'ceph_osd_crush_weight-set_dump', 'ceph_osd_df_tree', 'ceph_osd_utilization'] ['ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem_--format_json-pretty', 'ceph_mgr_services_--format_json-pretty', 'ceph_osd_crush_dump_--format_json-pretty', 'ceph_osd_crush_weight-set_dump_--format_json-pretty', 'ceph_osd_crush_weight-set_ls_--format_json-pretty', 'ceph_osd_df_tree_--format_json-pretty', 'ceph_osd_dump_--format_json-pretty', 'ceph_osd_tree_--format_json-pretty']
As discussed here https://chat.google.com/room/AAAAsMRYD8Y/vAhyFFGPVnY, this is not seen with the official image but with the rhceph-dev image. Rewant is trying to find the difference, but not a 4.9 blocker. Moving it out, please revert back if required.
As this will be available with the next 4.9 downstream build, moving it back to 4.9
Bug fixed. SetUp: Provider:BareMetal OCP Version: 4.9.0-0.nightly-2021-11-10-215111 ODF Version: 4.9.0-233.ci LSO Version: local-storage-operator.4.9.0-202111020858 Test Procesure: 1.Run must-gather command: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9 2.Check mg log -> threre is no "fork system call failed" log 3.Check all relevant files exist on mg dir
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086