Bug 1956285
Summary: | [must-gather] log collection for some ceph cmd failed with timeout: fork system call failed: Resource temporarily unavailable | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pratik Surve <prsurve> |
Component: | must-gather | Assignee: | Rewant <resoni> |
Status: | CLOSED ERRATA | QA Contact: | Oded <oviner> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.7 | CC: | branto, ebenahar, kramdoss, muagarwa, ocs-bugs, odf-bz-bot, resoni, sabose, tdesala |
Target Milestone: | --- | Keywords: | AutomationBackLog, Regression |
Target Release: | ODF 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-12-13 17:44:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pratik Surve
2021-05-03 11:52:18 UTC
The issue here is that the CRI-O default PID limit is 1024. And I am assuming that the nodes on which the MG was running have already had all the PIDs allocated, in order to fix this, one has to increase the PID limit from 1024, which is by default the maximum limit. The steps are written here https://access.redhat.com/solutions/5597061 Once done, the nodes will be updated one by one. Here's the output from a node before applying the fix sh-4.4# crio config | grep pids_limit time="2021-05-04T07:27:10Z" level=info msg="Starting CRI-O, version: 1.20.2-10.rhaos4.7.gitfc8b9e9.el8, git: ()" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL" pids_limit = 1024 Here's AFTER applying the fix. sh-4.4# crio config | grep pids_limit INFO[0000] Starting CRI-O, version: 1.20.2-10.rhaos4.7.gitfc8b9e9.el8, git: () INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL pids_limit = 2048 sh-4.4# Here's the file on the server http://magna012.ceph.redhat.com/bz/bz_1956285/quay-io-ocs-dev-ocs-must-gather-sha256-2dc20d4b2e8fc7000b2c7d7a31646b60f137d777e51868cf1be04ac86fc9bc15/ try searching for runtime/cgo. The issue was the one I explained above. Moving it to POST, please comment to that I can change states if needed. Marking as NEW again since this issue is fixed with the above patch if used use quay.io/ocs-dev/ocs-must-gather image but fails if used quay.io/rhceph-dev/ocs-must-gather:latest-4.7 imag @branto can you please shed some lights on the behavior of http://quay.io/rhceph-dev repo and why he image from that repo fails to give the same result as the Downstream Built?. Thanks Hi Bosirs, with the patch, I meant the fix that I applied from here https://access.redhat.com/solutions/5597061 it increases the PIDs so that the mg script can run, but the issue still persists with the Downstream image, that is why I asked you if there's something happening on your end since after applying the fix above, the Upstream image solves this issue. Where did you apply the fix? In the Dockerfile? Somewhere else? In the cluster itself, please take a look at the fix itself. https://access.redhat.com/solutions/5597061 Did you apply it manually? How is this supposed to be automated? I apply it manually tho, but if it kept happening again, the platform team has to automate it or increase the default no of pids. OK, it looks like this requires some i/o to be run to hit this, right? The article describes how to fix this in that case. I presume the i/o takes up some pids and we are hitting a default pid limit. There was no code change in the upstream image right? You only applied the fix manually right? I presume we might have been hitting even the increased pid limit with the downstream image then (depending on how much i/o was occurring). Is this reproducible regularly? Should we try to automate it in some way? (In reply to Boris Ranto from comment #13) > OK, it looks like this requires some i/o to be run to hit this, right? The > article describes how to fix this in that case. I presume the i/o takes up > some pids and we are hitting a default pid limit. > > There was no code change in the upstream image right? You only applied the > fix manually right? I presume we might have been hitting even the increased > pid limit with the downstream image then (depending on how much i/o was > occurring). Yes, there was no code change but this is occurring just in Upstream and not in Downstream. This was filed under Must Gather runs some tasks in the background. My concern was if there's something happening in the Upstream repo which is causing this change. > > Is this reproducible regularly? Should we try to automate it in some way? Yes, it's reproducible all the time, just run the must gather's upstream branch and it will happen again. Pratik, Are you able to apply the workaround suggested by Rajat in https://bugzilla.redhat.com/show_bug.cgi?id=1956285#c4? Also, is this reproducible in 4.8? Rajat, What is the customer impact if this is present only in upstream? Do the customers use rhceph-dev to run must-gather downstream? We have a workaround, not urgent enough to fix in 4.8 After the offline discussion (Elad, Pratik, Parth, Raz, Oded) removing the blocker flag. Here is the summary: WA is already documented https://access.redhat.com/solutions/5597061 work on a permanent solution, one of or a combination of: - decrease the number of commands being fired at once, which will probably cause MG to take more time - list the failed commands and re-collect them at the end this permanent solution we can track with https://bugzilla.redhat.com/show_bug.cgi?id=1956285 for 4.8.z or 4.9 Fix should be available in the latest ODF builds Bug reconstructed, fork system call failed SetUp: Provider:BareMetal OCP Version: 4.9.0-0.nightly-2021-10-16-173626 ODF Version: 4.9.0-192.ci LSO Version: local-storage-operator.4.9.0-202110121402 Test Proceudure: 1.Run must-gather command: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9 [must-gather-hlnss] POD collecting command output for: ceph osd blocked-by [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph osd blacklist ls [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph pg dump [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph pg stat [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph pool autoscale-status [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: line 125: ps: command not found [must-gather-hlnss] POD collecting command output for: ceph progress [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: retry: Resource temporarily unavailable [must-gather-hlnss] POD /usr/bin/gather_ceph_resources: fork: Interrupted system call 2.Check MG directory: Some files do not exist on mg directory: ['ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem', 'ceph_mds_stat', 'ceph_osd_crush_dump', 'ceph_osd_crush_show-tunables', 'ceph_osd_crush_weight-set_dump', 'ceph_osd_df_tree', 'ceph_osd_utilization'] ['ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem_--format_json-pretty', 'ceph_mgr_services_--format_json-pretty', 'ceph_osd_crush_dump_--format_json-pretty', 'ceph_osd_crush_weight-set_dump_--format_json-pretty', 'ceph_osd_crush_weight-set_ls_--format_json-pretty', 'ceph_osd_df_tree_--format_json-pretty', 'ceph_osd_dump_--format_json-pretty', 'ceph_osd_tree_--format_json-pretty'] As discussed here https://chat.google.com/room/AAAAsMRYD8Y/vAhyFFGPVnY, this is not seen with the official image but with the rhceph-dev image. Rewant is trying to find the difference, but not a 4.9 blocker. Moving it out, please revert back if required. As this will be available with the next 4.9 downstream build, moving it back to 4.9 Bug fixed. SetUp: Provider:BareMetal OCP Version: 4.9.0-0.nightly-2021-11-10-215111 ODF Version: 4.9.0-233.ci LSO Version: local-storage-operator.4.9.0-202111020858 Test Procesure: 1.Run must-gather command: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9 2.Check mg log -> threre is no "fork system call failed" log 3.Check all relevant files exist on mg dir Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086 |